Agentic Metric

The Agentic metric evaluates AI agent performance by measuring complete conversation correctness. A conversation is correct only if ALL its interactions are correct. It supports pluggable statistical modes — frequentist returns point estimates for pass@K, Bayesian propagates the uncertainty in the estimated success rate through the pass@K formula to produce credible intervals.

Overview

Conversation Correctness: A conversation is correct only if ALL interactions are correct
pass@K: Probability of ≥1 correct conversation when attempting k conversations (0.0–1.0)
pass^K: Probability of all k conversations being correct (0.0–1.0)
Tool Correctness: Evaluates tool selection, parameter accuracy, execution sequence, and result utilization per interaction

Formulas

pass@k = 1 - (1 - p)^k   # Probability of ≥1 correct in k independent attempts
pass^k = p^k              # Probability of all k attempts correct

Where p = estimated success rate from evaluation

Frequentist: p = c/n — a point estimate Bayesian: p is a Beta-Binomial posterior distribution — the pass@K formula is applied across all posterior samples, yielding a credible interval for pass@K and pass^K

k is a required parameter. pass@K and pass^K are computed per conversation using n = total_interactions and c = correct_interactions. The default tool_threshold=1.0 requires perfect tool usage — lower it (e.g. 0.75) to allow minor deviations.

Installation

uv add "alquimia-fair-forge[agentic]"
uv add langchain-groq  # Or your preferred LLM provider

Basic Usage

from fair_forge.metrics.agentic import Agentic
from langchain_groq import ChatGroq
from your_retriever import AgenticRetriever

judge_model = ChatGroq(model="llama-3.3-70b-versatile", api_key="your-api-key", temperature=0.0)

metrics = Agentic.run(
    AgenticRetriever,
    model=judge_model,
    k=3,
    threshold=0.7,
)

for metric in metrics:
    print(f"{metric.session_id}:")
    print(f"  pass@{metric.k} = {metric.pass_at_k:.3f}")
    print(f"  pass^{metric.k} = {metric.pass_pow_k:.3f}")
    # metric.pass_at_k_ci_low / ci_high are None in frequentist mode

Parameters

Required Parameters

Parameter	Type	Description
`retriever`	`Type[Retriever]`	Data source class — each Dataset = 1 conversation
`model`	`BaseChatModel`	LangChain-compatible model for LLM-as-judge evaluation
`k`	`int`	Number of independent attempts for pass@K/pass^K computation

Optional Parameters

Parameter	Type	Default	Description
`statistical_mode`	`StatisticalMode`	`FrequentistMode()`	Statistical computation mode
`threshold`	`float`	`0.7`	Answer correctness threshold (0.0–1.0)
`tool_threshold`	`float`	`1.0`	Tool correctness threshold (0.0–1.0)
`tool_weights`	`dict[str, float]`	`0.25` each	Weights for tool aspects (selection, parameters, sequence, utilization)
`use_structured_output`	`bool`	`True`	Use LangChain structured output
`bos_json_clause`	`str`	"```json"	JSON block start marker
`eos_json_clause`	`str`	"```"	JSON block end marker
`verbose`	`bool`	`False`	Enable verbose logging

Statistical Modes

Frequentist
Bayesian

Computes p = c/n as a point estimate and plugs it directly into the pass@K formulas. Simple and fast.

# With 7 correct out of 10 interactions, k=3:
# p = 7/10 = 0.70
# pass@3 = 1 - (1 - 0.70)^3 = 0.973
# pass^3 = 0.70^3 = 0.343

pass_at_k_ci_low, pass_at_k_ci_high, pass_pow_k_ci_low, pass_pow_k_ci_high are all None.

Uses a Beta-Binomial posterior over p. The pass@K formula is applied vectorized across all MC samples, yielding a full posterior distribution for both pass@K and pass^K.

# With 7 correct out of 10 interactions, k=3, Beta(1,1) prior:
# Posterior for p: Beta(8, 4) — centered at 0.67 but with uncertainty
# pass@3 samples: 1 - (1 - p_samples)^3  → mean=0.960, CI=[0.820, 0.998]
# pass^3 samples: p_samples^3            → mean=0.330, CI=[0.126, 0.570]

The CI tells you: with only 10 observations, your true pass@3 could plausibly be anywhere in that range.

from fair_forge.statistical import BayesianMode

metrics = Agentic.run(
    AgenticRetriever,
    model=judge_model,
    k=3,
    statistical_mode=BayesianMode(
        mc_samples=5000,
        ci_level=0.95,
        beta_prior_a=1.0,
        beta_prior_b=1.0,
    ),
)

for metric in metrics:
    print(f"pass@3 = {metric.pass_at_k:.3f}  [{metric.pass_at_k_ci_low:.3f}, {metric.pass_at_k_ci_high:.3f}]")
    print(f"pass^3 = {metric.pass_pow_k:.3f}  [{metric.pass_pow_k_ci_low:.3f}, {metric.pass_pow_k_ci_high:.3f}]")

Why Bayesian matters for agentic evaluation: A pass@3 of 0.90 sounds great — but if it comes from only 5 conversations, the 95% CI might be [0.55, 0.99]. With 100 conversations, the same rate gives [0.84, 0.95], which is much more trustworthy. Use Bayesian mode when you have few test conversations and need to communicate reliability honestly.

Data Requirements

Each Dataset represents one complete conversation. A conversation is correct only if ALL interactions are correct:

from fair_forge.core.retriever import Retriever
from fair_forge.schemas.common import Dataset, Batch

class AgenticRetriever(Retriever):
    def load_dataset(self) -> list[Dataset]:
        return [
            Dataset(
                session_id="conversation_001",
                assistant_id="agent_v1",
                language="english",
                context="Math calculator conversation",
                conversation=[
                    Batch(
                        qa_id="q1_interaction1",
                        query="What is 5 + 3?",
                        assistant="The result is 8.",
                        ground_truth_assistant="8",
                        agentic={
                            "tools_used": [{
                                "tool_name": "calculator",
                                "parameters": {"a": 5, "b": 3},
                                "result": 8,
                                "step": 1
                            }],
                            "final_answer_uses_tools": True
                        },
                        ground_truth_agentic={
                            "expected_tools": [{
                                "tool_name": "calculator",
                                "parameters": {"a": 5, "b": 3},
                                "step": 1
                            }],
                            "tool_sequence_matters": False
                        }
                    ),
                    Batch(
                        qa_id="q1_interaction2",
                        query="What is 100 / 4?",
                        assistant="100 divided by 4 is 25.",
                        ground_truth_assistant="25"
                    ),
                ],
            ),
        ]

Agentic Data Structure

agentic (actual tool usage):

{
    "tools_used": [{"tool_name": "calculator", "parameters": {"a": 5, "b": 3}, "result": 8, "step": 1}],
    "final_answer_uses_tools": True
}

ground_truth_agentic (expected tool usage):

{
    "expected_tools": [{"tool_name": "calculator", "parameters": {"a": 5, "b": 3}, "step": 1}],
    "tool_sequence_matters": False
}

Output Schema

AgenticMetric

class AgenticMetric(BaseMetric):
    session_id: str                                             # Unique conversation ID
    total_interactions: int                                     # Interactions in conversation
    correct_interactions: int                                   # Correct interactions
    is_fully_correct: bool                                      # True if ALL interactions correct
    threshold: float                                            # Answer correctness threshold
    correctness_scores: list[float]                             # Score per interaction
    correct_indices: list[int]                                  # Indices of correct interactions
    tool_correctness_scores: list[ToolCorrectnessScore | None]  # Tool scores per interaction
    k: int                                                      # Attempts used for pass@K
    pass_at_k: float                                            # P(≥1 fully correct in k attempts)
    pass_at_k_ci_low: float | None                              # Lower CI — Bayesian only
    pass_at_k_ci_high: float | None                             # Upper CI — Bayesian only
    pass_pow_k: float                                           # P(all k attempts fully correct)
    pass_pow_k_ci_low: float | None                             # Lower CI — Bayesian only
    pass_pow_k_ci_high: float | None                            # Upper CI — Bayesian only

ToolCorrectnessScore

class ToolCorrectnessScore(BaseModel):
    tool_selection_correct: float   # 0-1: Correct tools chosen
    parameter_accuracy: float       # 0-1: Correct parameters passed
    sequence_correct: float         # 0-1: Correct order (if required)
    result_utilization: float       # 0-1: Tool results used in answer
    overall_correctness: float      # Weighted average
    is_correct: bool                # overall >= tool_threshold
    reasoning: str | None           # Explanation

Interpretation

pass@K vs pass^K

Metric	Formula	Meaning
`pass_at_k`	`1 - (1-p)^k`	Probability of ≥1 correct conversation in k attempts
`pass_pow_k`	`p^k`	Probability of ALL k attempts being correct

Examples with k=3:

pass@3 = 0.92 → 92% chance of getting ≥1 fully correct conversation in 3 attempts
pass^3 = 0.15 → 15% chance all 3 conversations are fully correct

Agent Quality Assessment

pass@K	pass^K	Assessment
>0.95	>0.70	✅ Reliable — High success and consistency
>0.95	<0.50	⚠️ Inconsistent — Can succeed but unreliable
<0.70	any	❌ Needs Improvement — Low success rate

Tool Correctness Scores

Score Range	Interpretation
1.0	Perfect — all aspects correct
0.75–0.99	Good — minor issues
0.50–0.74	Moderate — several issues
< 0.50	Poor — significant problems

Aggregation Functions

pass_at_k and pass_pow_k are embedded in each AgenticMetric. The standalone functions are available for computing additional K values after the fact:

from fair_forge.metrics.agentic import pass_at_k, pass_pow_k

for metric in metrics:
    for k in [1, 3, 5, 10]:
        pak = pass_at_k(metric.total_interactions, metric.correct_interactions, k)
        ppk = pass_pow_k(metric.total_interactions, metric.correct_interactions, k)
        print(f"  K={k}: pass@K={pak:.3f}  pass^K={ppk:.3f}")

LLM Provider Options

from langchain_groq import ChatGroq

model = ChatGroq(model="llama-3.3-70b-versatile", api_key="your-api-key", temperature=0.0)

Custom Tool Weights

metrics = Agentic.run(
    AgenticRetriever,
    model=judge_model,
    k=3,
    tool_weights={
        "selection": 0.4,
        "parameters": 0.2,
        "sequence": 0.1,
        "utilization": 0.3,
    },
)

Best Practices

Use Bayesian Mode for Small Test Suites

If you have fewer than 30 conversations, frequentist pass@K estimates can be misleading. Bayesian mode shows you the credible interval, making it clear when more data is needed before drawing conclusions.

Choose Appropriate K Values

K=1: Evaluate single conversation success rate
K=3–5: Balance between reliability and cost (recommended)
K=10+: High-stakes scenarios requiring high confidence

Set Meaningful Thresholds

Strict (0.8–0.9): Factual accuracy matters (medical, legal)
Moderate (0.7): General purpose — recommended default
Lenient (0.6): Creative or subjective tasks

Define Clear Tool Expectations

Provide complete ground_truth_agentic per interaction with expected tool names, required parameters, whether sequence matters, and whether tool results should influence the final answer.

Use Strong Judge Models

Larger models (GPT-4, Claude-3, Llama-3-70B+) provide more reliable correctness evaluations.

Troubleshooting

Judge Returns Low Scores for Correct Answers

Lower the threshold parameter (try 0.6–0.65), use a more capable judge model, or ensure ground truth is clear and unambiguous. Check verbose logs to see judge reasoning.

Tool Correctness Always Fails

The default tool_threshold=1.0 requires perfect tool correctness. Lower it with tool_threshold=0.75 to allow minor deviations. Verify tool names match exactly (case-sensitive) and check parameter structure.

Some Interactions Have None for Tool Correctness

This is expected — tool_correctness_scores[i] is None when an interaction did not use tools.

valid = [tc for tc in metric.tool_correctness_scores if tc is not None]
avg = sum(tc.overall_correctness for tc in valid) / len(valid)

Bayesian CI is Very Wide

A wide CI means there is not enough data to estimate the true success rate precisely. This is intentional — collect more test conversations to narrow the interval.

Next Steps

Statistical Modes

Deep dive into Frequentist vs Bayesian approaches

BestOf Metric

Compare multiple agents in tournament-style evaluation

AWS Lambda

Deploy Agentic as a serverless function

Documentation Index

​Agentic Metric

​Overview

​Formulas

​Installation

​Basic Usage

​Parameters

​Required Parameters

​Optional Parameters

​Statistical Modes

​Data Requirements

​Agentic Data Structure

​Output Schema

​AgenticMetric

​ToolCorrectnessScore

​Interpretation

​pass@K vs pass^K

​Agent Quality Assessment

​Tool Correctness Scores

​Aggregation Functions

​LLM Provider Options

​Custom Tool Weights

​Best Practices

​Troubleshooting

​Next Steps

Statistical Modes

BestOf Metric

AWS Lambda

Agentic Metric

Overview

Formulas

Installation

Basic Usage

Parameters

Required Parameters

Optional Parameters

Statistical Modes

Data Requirements

Agentic Data Structure

Output Schema

AgenticMetric

ToolCorrectnessScore

Interpretation

pass@K vs pass^K

Agent Quality Assessment

Tool Correctness Scores

Aggregation Functions

LLM Provider Options

Custom Tool Weights

Best Practices

Troubleshooting

Next Steps