The Retriever class is the entry point for loading your conversation data into Fair Forge. Every evaluation requires a custom retriever that implements load_dataset(). Fair Forge supports three iteration strategies that control how data is consumed — from loading everything upfront to yielding individual QA pairs on demand.
The return type of load_dataset() must be consistent with iteration_level. Returning an Iterator with the default FULL_DATASET level will raise a ValueError at runtime.
BestOf expects one Dataset per assistant, all answering the same questions. The retriever loads them — BestOf handles the tournament logic automatically.
Copy
import jsonfrom pathlib import Pathfrom fair_forge.core.retriever import Retrieverfrom fair_forge.schemas.common import Datasetclass MultiAssistantRetriever(Retriever): """Each Dataset in the file represents one assistant answering the same questions.""" def __init__(self, file_path: str = "dataset_bestof.json", **kwargs): super().__init__(**kwargs) self.file_path = Path(file_path) def load_dataset(self) -> list[Dataset]: with open(self.file_path) as f: return [Dataset.model_validate(item) for item in json.load(f)]# Usage — BestOf only needs model and criteria; the retriever provides the assistantsmetrics = BestOf.run( MultiAssistantRetriever, model=judge_model, use_structured_output=True, criteria="Overall response quality, helpfulness, and clarity",)
The JSON file must contain one entry per assistant, all sharing the same qa_id values so BestOf can pair their responses:
Copy
[ { "session_id": "eval-session", "assistant_id": "assistant_alpha", "language": "english", "context": "", "conversation": [ {"qa_id": "q1", "query": "What are the benefits of renewable energy?", "assistant": "...", "ground_truth_assistant": ""} ] }, { "session_id": "eval-session", "assistant_id": "assistant_beta", "language": "english", "context": "", "conversation": [ {"qa_id": "q1", "query": "What are the benefits of renewable energy?", "assistant": "...", "ground_truth_assistant": ""} ] }]