Agentic Metric
The Agentic metric evaluates AI agent performance by measuring complete conversation correctness. A conversation is correct only if ALL its interactions are correct. It supports pluggable statistical modes — frequentist returns point estimates for pass@K, Bayesian propagates the uncertainty in the estimated success rate through the pass@K formula to produce credible intervals.Overview
- Conversation Correctness: A conversation is correct only if ALL interactions are correct
- pass@K: Probability of ≥1 correct conversation when attempting k conversations (0.0–1.0)
- pass^K: Probability of all k conversations being correct (0.0–1.0)
- Tool Correctness: Evaluates tool selection, parameter accuracy, execution sequence, and result utilization per interaction
Formulas
p = c/n — a point estimate
Bayesian: p is a Beta-Binomial posterior distribution — the pass@K formula is applied across all posterior samples, yielding a credible interval for pass@K and pass^K
k is a required parameter. pass@K and pass^K are computed per conversation using n = total_interactions and c = correct_interactions. The default tool_threshold=1.0 requires perfect tool usage — lower it (e.g. 0.75) to allow minor deviations.Installation
Basic Usage
Parameters
Required Parameters
| Parameter | Type | Description |
|---|---|---|
retriever | Type[Retriever] | Data source class — each Dataset = 1 conversation |
model | BaseChatModel | LangChain-compatible model for LLM-as-judge evaluation |
k | int | Number of independent attempts for pass@K/pass^K computation |
Optional Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
statistical_mode | StatisticalMode | FrequentistMode() | Statistical computation mode |
threshold | float | 0.7 | Answer correctness threshold (0.0–1.0) |
tool_threshold | float | 1.0 | Tool correctness threshold (0.0–1.0) |
tool_weights | dict[str, float] | 0.25 each | Weights for tool aspects (selection, parameters, sequence, utilization) |
use_structured_output | bool | True | Use LangChain structured output |
bos_json_clause | str | "```json" | JSON block start marker |
eos_json_clause | str | "```" | JSON block end marker |
verbose | bool | False | Enable verbose logging |
Statistical Modes
- Frequentist
- Bayesian
Computes
p = c/n as a point estimate and plugs it directly into the pass@K formulas. Simple and fast.pass_at_k_ci_low, pass_at_k_ci_high, pass_pow_k_ci_low, pass_pow_k_ci_high are all None.Why Bayesian matters for agentic evaluation: A pass@3 of 0.90 sounds great — but if it comes from only 5 conversations, the 95% CI might be [0.55, 0.99]. With 100 conversations, the same rate gives [0.84, 0.95], which is much more trustworthy. Use Bayesian mode when you have few test conversations and need to communicate reliability honestly.
Data Requirements
EachDataset represents one complete conversation. A conversation is correct only if ALL interactions are correct:
Agentic Data Structure
agentic (actual tool usage):
ground_truth_agentic (expected tool usage):
Output Schema
AgenticMetric
ToolCorrectnessScore
Interpretation
pass@K vs pass^K
| Metric | Formula | Meaning |
|---|---|---|
pass_at_k | 1 - (1-p)^k | Probability of ≥1 correct conversation in k attempts |
pass_pow_k | p^k | Probability of ALL k attempts being correct |
- pass@3 = 0.92 → 92% chance of getting ≥1 fully correct conversation in 3 attempts
- pass^3 = 0.15 → 15% chance all 3 conversations are fully correct
Agent Quality Assessment
| pass@K | pass^K | Assessment |
|---|---|---|
| >0.95 | >0.70 | ✅ Reliable — High success and consistency |
| >0.95 | <0.50 | ⚠️ Inconsistent — Can succeed but unreliable |
| <0.70 | any | ❌ Needs Improvement — Low success rate |
Tool Correctness Scores
| Score Range | Interpretation |
|---|---|
| 1.0 | Perfect — all aspects correct |
| 0.75–0.99 | Good — minor issues |
| 0.50–0.74 | Moderate — several issues |
| < 0.50 | Poor — significant problems |
Aggregation Functions
pass_at_k and pass_pow_k are embedded in each AgenticMetric. The standalone functions are available for computing additional K values after the fact:
LLM Provider Options
Custom Tool Weights
Best Practices
Use Bayesian Mode for Small Test Suites
Use Bayesian Mode for Small Test Suites
If you have fewer than 30 conversations, frequentist pass@K estimates can be misleading. Bayesian mode shows you the credible interval, making it clear when more data is needed before drawing conclusions.
Choose Appropriate K Values
Choose Appropriate K Values
- K=1: Evaluate single conversation success rate
- K=3–5: Balance between reliability and cost (recommended)
- K=10+: High-stakes scenarios requiring high confidence
Set Meaningful Thresholds
Set Meaningful Thresholds
- Strict (0.8–0.9): Factual accuracy matters (medical, legal)
- Moderate (0.7): General purpose — recommended default
- Lenient (0.6): Creative or subjective tasks
Define Clear Tool Expectations
Define Clear Tool Expectations
Provide complete
ground_truth_agentic per interaction with expected tool names, required parameters, whether sequence matters, and whether tool results should influence the final answer.Use Strong Judge Models
Use Strong Judge Models
Larger models (GPT-4, Claude-3, Llama-3-70B+) provide more reliable correctness evaluations.
Troubleshooting
Judge Returns Low Scores for Correct Answers
Judge Returns Low Scores for Correct Answers
Lower the
threshold parameter (try 0.6–0.65), use a more capable judge model, or ensure ground truth is clear and unambiguous. Check verbose logs to see judge reasoning.Tool Correctness Always Fails
Tool Correctness Always Fails
The default
tool_threshold=1.0 requires perfect tool correctness. Lower it with tool_threshold=0.75 to allow minor deviations. Verify tool names match exactly (case-sensitive) and check parameter structure.Some Interactions Have None for Tool Correctness
Some Interactions Have None for Tool Correctness
This is expected —
tool_correctness_scores[i] is None when an interaction did not use tools.Bayesian CI is Very Wide
Bayesian CI is Very Wide
A wide CI means there is not enough data to estimate the true success rate precisely. This is intentional — collect more test conversations to narrow the interval.
Next Steps
Statistical Modes
Deep dive into Frequentist vs Bayesian approaches
BestOf Metric
Compare multiple agents in tournament-style evaluation
AWS Lambda
Deploy Agentic as a serverless function