Skip to main content

Agentic Metric

The Agentic metric evaluates AI agent performance by measuring complete conversation correctness. A conversation is correct only if ALL its interactions are correct. It supports pluggable statistical modes — frequentist returns point estimates for pass@K, Bayesian propagates the uncertainty in the estimated success rate through the pass@K formula to produce credible intervals.

Overview

  • Conversation Correctness: A conversation is correct only if ALL interactions are correct
  • pass@K: Probability of ≥1 correct conversation when attempting k conversations (0.0–1.0)
  • pass^K: Probability of all k conversations being correct (0.0–1.0)
  • Tool Correctness: Evaluates tool selection, parameter accuracy, execution sequence, and result utilization per interaction

Formulas

pass@k = 1 - (1 - p)^k   # Probability of ≥1 correct in k independent attempts
pass^k = p^k              # Probability of all k attempts correct

Where p = estimated success rate from evaluation
Frequentist: p = c/n — a point estimate Bayesian: p is a Beta-Binomial posterior distribution — the pass@K formula is applied across all posterior samples, yielding a credible interval for pass@K and pass^K
k is a required parameter. pass@K and pass^K are computed per conversation using n = total_interactions and c = correct_interactions. The default tool_threshold=1.0 requires perfect tool usage — lower it (e.g. 0.75) to allow minor deviations.

Installation

uv add "alquimia-fair-forge[agentic]"
uv add langchain-groq  # Or your preferred LLM provider

Basic Usage

from fair_forge.metrics.agentic import Agentic
from langchain_groq import ChatGroq
from your_retriever import AgenticRetriever

judge_model = ChatGroq(model="llama-3.3-70b-versatile", api_key="your-api-key", temperature=0.0)

metrics = Agentic.run(
    AgenticRetriever,
    model=judge_model,
    k=3,
    threshold=0.7,
)

for metric in metrics:
    print(f"{metric.session_id}:")
    print(f"  pass@{metric.k} = {metric.pass_at_k:.3f}")
    print(f"  pass^{metric.k} = {metric.pass_pow_k:.3f}")
    # metric.pass_at_k_ci_low / ci_high are None in frequentist mode

Parameters

Required Parameters

ParameterTypeDescription
retrieverType[Retriever]Data source class — each Dataset = 1 conversation
modelBaseChatModelLangChain-compatible model for LLM-as-judge evaluation
kintNumber of independent attempts for pass@K/pass^K computation

Optional Parameters

ParameterTypeDefaultDescription
statistical_modeStatisticalModeFrequentistMode()Statistical computation mode
thresholdfloat0.7Answer correctness threshold (0.0–1.0)
tool_thresholdfloat1.0Tool correctness threshold (0.0–1.0)
tool_weightsdict[str, float]0.25 eachWeights for tool aspects (selection, parameters, sequence, utilization)
use_structured_outputboolTrueUse LangChain structured output
bos_json_clausestr"```json"JSON block start marker
eos_json_clausestr"```"JSON block end marker
verboseboolFalseEnable verbose logging

Statistical Modes

Computes p = c/n as a point estimate and plugs it directly into the pass@K formulas. Simple and fast.
# With 7 correct out of 10 interactions, k=3:
# p = 7/10 = 0.70
# pass@3 = 1 - (1 - 0.70)^3 = 0.973
# pass^3 = 0.70^3 = 0.343
pass_at_k_ci_low, pass_at_k_ci_high, pass_pow_k_ci_low, pass_pow_k_ci_high are all None.
Why Bayesian matters for agentic evaluation: A pass@3 of 0.90 sounds great — but if it comes from only 5 conversations, the 95% CI might be [0.55, 0.99]. With 100 conversations, the same rate gives [0.84, 0.95], which is much more trustworthy. Use Bayesian mode when you have few test conversations and need to communicate reliability honestly.

Data Requirements

Each Dataset represents one complete conversation. A conversation is correct only if ALL interactions are correct:
from fair_forge.core.retriever import Retriever
from fair_forge.schemas.common import Dataset, Batch

class AgenticRetriever(Retriever):
    def load_dataset(self) -> list[Dataset]:
        return [
            Dataset(
                session_id="conversation_001",
                assistant_id="agent_v1",
                language="english",
                context="Math calculator conversation",
                conversation=[
                    Batch(
                        qa_id="q1_interaction1",
                        query="What is 5 + 3?",
                        assistant="The result is 8.",
                        ground_truth_assistant="8",
                        agentic={
                            "tools_used": [{
                                "tool_name": "calculator",
                                "parameters": {"a": 5, "b": 3},
                                "result": 8,
                                "step": 1
                            }],
                            "final_answer_uses_tools": True
                        },
                        ground_truth_agentic={
                            "expected_tools": [{
                                "tool_name": "calculator",
                                "parameters": {"a": 5, "b": 3},
                                "step": 1
                            }],
                            "tool_sequence_matters": False
                        }
                    ),
                    Batch(
                        qa_id="q1_interaction2",
                        query="What is 100 / 4?",
                        assistant="100 divided by 4 is 25.",
                        ground_truth_assistant="25"
                    ),
                ],
            ),
        ]

Agentic Data Structure

agentic (actual tool usage):
{
    "tools_used": [{"tool_name": "calculator", "parameters": {"a": 5, "b": 3}, "result": 8, "step": 1}],
    "final_answer_uses_tools": True
}
ground_truth_agentic (expected tool usage):
{
    "expected_tools": [{"tool_name": "calculator", "parameters": {"a": 5, "b": 3}, "step": 1}],
    "tool_sequence_matters": False
}

Output Schema

AgenticMetric

class AgenticMetric(BaseMetric):
    session_id: str                                             # Unique conversation ID
    total_interactions: int                                     # Interactions in conversation
    correct_interactions: int                                   # Correct interactions
    is_fully_correct: bool                                      # True if ALL interactions correct
    threshold: float                                            # Answer correctness threshold
    correctness_scores: list[float]                             # Score per interaction
    correct_indices: list[int]                                  # Indices of correct interactions
    tool_correctness_scores: list[ToolCorrectnessScore | None]  # Tool scores per interaction
    k: int                                                      # Attempts used for pass@K
    pass_at_k: float                                            # P(≥1 fully correct in k attempts)
    pass_at_k_ci_low: float | None                              # Lower CI — Bayesian only
    pass_at_k_ci_high: float | None                             # Upper CI — Bayesian only
    pass_pow_k: float                                           # P(all k attempts fully correct)
    pass_pow_k_ci_low: float | None                             # Lower CI — Bayesian only
    pass_pow_k_ci_high: float | None                            # Upper CI — Bayesian only

ToolCorrectnessScore

class ToolCorrectnessScore(BaseModel):
    tool_selection_correct: float   # 0-1: Correct tools chosen
    parameter_accuracy: float       # 0-1: Correct parameters passed
    sequence_correct: float         # 0-1: Correct order (if required)
    result_utilization: float       # 0-1: Tool results used in answer
    overall_correctness: float      # Weighted average
    is_correct: bool                # overall >= tool_threshold
    reasoning: str | None           # Explanation

Interpretation

pass@K vs pass^K

MetricFormulaMeaning
pass_at_k1 - (1-p)^kProbability of ≥1 correct conversation in k attempts
pass_pow_kp^kProbability of ALL k attempts being correct
Examples with k=3:
  • pass@3 = 0.92 → 92% chance of getting ≥1 fully correct conversation in 3 attempts
  • pass^3 = 0.15 → 15% chance all 3 conversations are fully correct

Agent Quality Assessment

pass@Kpass^KAssessment
>0.95>0.70Reliable — High success and consistency
>0.95<0.50⚠️ Inconsistent — Can succeed but unreliable
<0.70anyNeeds Improvement — Low success rate

Tool Correctness Scores

Score RangeInterpretation
1.0Perfect — all aspects correct
0.75–0.99Good — minor issues
0.50–0.74Moderate — several issues
< 0.50Poor — significant problems

Aggregation Functions

pass_at_k and pass_pow_k are embedded in each AgenticMetric. The standalone functions are available for computing additional K values after the fact:
from fair_forge.metrics.agentic import pass_at_k, pass_pow_k

for metric in metrics:
    for k in [1, 3, 5, 10]:
        pak = pass_at_k(metric.total_interactions, metric.correct_interactions, k)
        ppk = pass_pow_k(metric.total_interactions, metric.correct_interactions, k)
        print(f"  K={k}: pass@K={pak:.3f}  pass^K={ppk:.3f}")

LLM Provider Options

from langchain_groq import ChatGroq

model = ChatGroq(model="llama-3.3-70b-versatile", api_key="your-api-key", temperature=0.0)

Custom Tool Weights

metrics = Agentic.run(
    AgenticRetriever,
    model=judge_model,
    k=3,
    tool_weights={
        "selection": 0.4,
        "parameters": 0.2,
        "sequence": 0.1,
        "utilization": 0.3,
    },
)

Best Practices

If you have fewer than 30 conversations, frequentist pass@K estimates can be misleading. Bayesian mode shows you the credible interval, making it clear when more data is needed before drawing conclusions.
  • K=1: Evaluate single conversation success rate
  • K=3–5: Balance between reliability and cost (recommended)
  • K=10+: High-stakes scenarios requiring high confidence
  • Strict (0.8–0.9): Factual accuracy matters (medical, legal)
  • Moderate (0.7): General purpose — recommended default
  • Lenient (0.6): Creative or subjective tasks
Provide complete ground_truth_agentic per interaction with expected tool names, required parameters, whether sequence matters, and whether tool results should influence the final answer.
Larger models (GPT-4, Claude-3, Llama-3-70B+) provide more reliable correctness evaluations.

Troubleshooting

Lower the threshold parameter (try 0.6–0.65), use a more capable judge model, or ensure ground truth is clear and unambiguous. Check verbose logs to see judge reasoning.
The default tool_threshold=1.0 requires perfect tool correctness. Lower it with tool_threshold=0.75 to allow minor deviations. Verify tool names match exactly (case-sensitive) and check parameter structure.
This is expected — tool_correctness_scores[i] is None when an interaction did not use tools.
valid = [tc for tc in metric.tool_correctness_scores if tc is not None]
avg = sum(tc.overall_correctness for tc in valid) / len(valid)
A wide CI means there is not enough data to estimate the true success rate precisely. This is intentional — collect more test conversations to narrow the interval.

Next Steps

Statistical Modes

Deep dive into Frequentist vs Bayesian approaches

BestOf Metric

Compare multiple agents in tournament-style evaluation

AWS Lambda

Deploy Agentic as a serverless function