Skip to main content

Context Metric

The Context metric evaluates how well an AI assistant’s responses align with the provided system context. It accumulates context_awareness scores across all interactions in a session and emits one session-level result, with optional uncertainty quantification via Bayesian mode. The interactions list preserves per-QA scores for debugging.

Overview

  • Context Awareness: How closely the response follows the given context (0.0–1.0)
  • Session aggregate: Weighted mean across all interactions
  • Per-interaction detail: Each QA pair’s score accessible via interactions
  • Bayesian mode: Bootstrapped credible interval around the session mean

Installation

uv add "alquimia-fair-forge[context]"
uv add langchain-groq  # Or your preferred LLM provider

Basic Usage

from fair_forge.metrics.context import Context
from langchain_groq import ChatGroq
from your_retriever import MyRetriever

judge_model = ChatGroq(model="llama-3.3-70b-versatile", api_key="your-api-key", temperature=0.0)

metrics = Context.run(MyRetriever, model=judge_model)

for metric in metrics:
    print(f"Session: {metric.session_id}  ({metric.n_interactions} interactions)")
    print(f"  Context awareness: {metric.context_awareness:.2f}")

    for interaction in metric.interactions:
        status = "✅" if interaction.context_awareness >= 0.8 else "❌"
        print(f"  {status} [{interaction.qa_id}] {interaction.context_awareness:.2f}")

Parameters

Required Parameters

ParameterTypeDescription
retrieverType[Retriever]Data source class
modelBaseChatModelLangChain-compatible judge model

Optional Parameters

ParameterTypeDefaultDescription
statistical_modeStatisticalModeFrequentistMode()Statistical computation mode
use_structured_outputboolFalseUse LangChain structured output
bos_json_clausestr"```json"JSON block start marker
eos_json_clausestr"```"JSON block end marker
verboseboolFalseEnable verbose logging

Statistical Modes

Returns the weighted mean of per-interaction scores. CI fields are None.
metric.context_awareness          # 0.78
metric.context_awareness_ci_low   # None
metric.context_awareness_ci_high  # None

Interaction Weights

Each Batch can carry an optional weight to control its contribution to the session aggregate:
# Weight critical interactions more heavily
Batch(qa_id="q1", ..., weight=0.5),   # Most important
Batch(qa_id="q2", ..., weight=0.3),
Batch(qa_id="q3", ..., weight=0.2),   # Least important
CaseBehavior
All weights provided, sum = 1.0Used as-is
All weights provided, sum ≠ 1.0Warning emitted, equal weights applied
Some weights providedRemaining weight split equally among unweighted
No weights providedEqual weights (1/n each)

Output Schema

ContextMetric

class ContextMetric(BaseMetric):
    session_id: str
    assistant_id: str
    n_interactions: int                    # Number of interactions evaluated
    context_awareness: float               # Weighted mean (0.0-1.0)
    context_awareness_ci_low: float | None # Lower credible bound — Bayesian only
    context_awareness_ci_high: float | None # Upper credible bound — Bayesian only
    interactions: list[ContextInteraction]  # Per-QA scores

ContextInteraction

class ContextInteraction(BaseModel):
    qa_id: str
    context_awareness: float   # Per-interaction score (0.0-1.0)

Interpretation

Context Awareness Score

Score RangeInterpretation
0.8–1.0Excellent — response fully follows context
0.6–0.8Good — mostly follows context with minor deviations
0.4–0.6Moderate — partially follows context
0.2–0.4Poor — significant deviations
0.0–0.2Very poor — ignores or contradicts context

Complete Example

import os
from fair_forge.metrics.context import Context
from fair_forge.statistical import BayesianMode
from fair_forge.core.retriever import Retriever
from fair_forge.schemas.common import Dataset, Batch
from langchain_groq import ChatGroq

class ContextTestRetriever(Retriever):
    def load_dataset(self) -> list[Dataset]:
        context = """You are a helpful customer service assistant for TechStore.
        Key policies:
        - Returns accepted within 30 days with receipt
        - Free shipping on orders over $50
        - Support hours: Mon-Fri 9am-5pm EST
        Always be polite and offer to help further."""

        return [
            Dataset(
                session_id="context-eval-001",
                assistant_id="techstore-bot",
                language="english",
                context=context,
                conversation=[
                    Batch(
                        qa_id="q1",
                        query="What's your return policy?",
                        assistant="Returns within 30 days with a receipt. Anything else I can help with?",
                        ground_truth_assistant="Returns within 30 days with receipt.",
                    ),
                    Batch(
                        qa_id="q2",
                        query="Do you offer free shipping?",
                        assistant="Yes, free shipping on orders over $50.",
                        ground_truth_assistant="Free shipping on orders over $50.",
                    ),
                    Batch(
                        qa_id="q3",
                        query="What are your hours?",
                        assistant="We're open 24/7!",  # Wrong — context says Mon-Fri 9-5
                        ground_truth_assistant="Mon-Fri 9am-5pm EST.",
                    ),
                ]
            )
        ]

judge = ChatGroq(model="llama-3.3-70b-versatile", api_key=os.getenv("GROQ_API_KEY"), temperature=0.0)

metrics = Context.run(
    ContextTestRetriever,
    model=judge,
    statistical_mode=BayesianMode(mc_samples=5000, ci_level=0.95),
    use_structured_output=True,
)

for metric in metrics:
    print(f"Session: {metric.session_id}  ({metric.n_interactions} interactions)")
    ci = f"  [{metric.context_awareness_ci_low:.2f}, {metric.context_awareness_ci_high:.2f}]" \
         if metric.context_awareness_ci_low is not None else ""
    print(f"  Context awareness: {metric.context_awareness:.2f}{ci}")
    print()
    print("  Per-interaction:")
    for interaction in metric.interactions:
        status = "✅ PASS" if interaction.context_awareness >= 0.8 else "❌ FAIL"
        print(f"    [{interaction.qa_id}] {status}  score={interaction.context_awareness:.2f}")

LLM Provider Options

from langchain_groq import ChatGroq
judge = ChatGroq(model="llama-3.3-70b-versatile", api_key="your-api-key", temperature=0.0)

Best Practices

A session with 3 interactions gives a very uncertain mean. Bayesian mode expresses this with a wide CI, preventing overconfident conclusions.
Include specific, actionable instructions:
context = """You are a support assistant for Acme Corp.
Rules:
1. Always greet customers by name if available
2. Never discuss competitors
3. Escalate billing issues to human support"""
Provide ground_truth_assistant for better evaluation:
Batch(
    qa_id="q1",
    query="What's your refund policy?",
    assistant="Refunds take 5-7 business days...",
    ground_truth_assistant="Refunds processed within 5-7 business days to original payment method.",
)
If some QA pairs test more important context rules, give them higher weights:
Batch(qa_id="policy_question",   ..., weight=0.6),  # High-stakes
Batch(qa_id="greeting",          ..., weight=0.2),
Batch(qa_id="general_question",  ..., weight=0.2),

Next Steps

Statistical Modes

Frequentist vs Bayesian approaches

Conversational Metric

Evaluate dialogue quality with Grice’s maxims

Regulatory Metric

Compliance against a regulatory corpus