BestOf Metric
The BestOf metric runs tournament-style comparisons between multiple AI assistants to determine which performs best.Overview
The metric:- Pairs assistants in elimination rounds
- Uses an LLM judge to evaluate each matchup
- Advances winners until a final champion is determined
- Handles ties (both advance) and byes (odd number of contestants)
Installation
Basic Usage
Parameters
Required Parameters
| Parameter | Type | Description |
|---|---|---|
retriever | Type[Retriever] | Data source with multiple assistants |
model | BaseChatModel | LangChain-compatible judge model |
Optional Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
criteria | str | "BestOf" | Evaluation criteria for judging |
use_structured_output | bool | False | Use LangChain structured output |
bos_json_clause | str | "```json" | JSON block start marker |
eos_json_clause | str | "```" | JSON block end marker |
verbose | bool | False | Enable verbose logging |
Data Requirements
BestOf requires datasets from multiple assistants answering the same questions:Output Schema
BestOfMetric
BestOfContest
Tournament Structure
Example: 4 Contestants
Special Cases
Ties: Both assistants advance to the next round Byes: With odd numbers, one assistant gets a free passComplete Example
Visualization
Tournament Bracket
Evaluation Criteria
Customize thecriteria parameter to focus on specific aspects:
Use Cases
Model Selection
Compare multiple LLMs to find the best for your use case
A/B Testing
Evaluate different prompt strategies or configurations
Quality Benchmarking
Establish baseline quality across assistant versions
Continuous Improvement
Track improvements between model versions
Best Practices
Use Consistent Questions
Use Consistent Questions
Ensure all assistants answer the exact same questions for fair comparison.
Include Diverse Queries
Include Diverse Queries
Test different types of interactions:
- Factual questions
- Creative tasks
- Problem-solving
- Multi-turn conversations
Define Clear Criteria
Define Clear Criteria
Be specific about what matters for your use case in the
criteria parameter.Use Strong Judge Models
Use Strong Judge Models
Larger models (GPT-4, Claude-3, Llama-3-70B) provide more reliable judgments.