Skip to main content

Documentation Index

Fetch the complete documentation index at: https://fairforge.alquimia.ai/llms.txt

Use this file to discover all available pages before exploring further.

Context Metric

The Context metric evaluates how well an AI assistant’s responses align with the provided system context. It accumulates context_awareness scores across all interactions in a session and emits one session-level result, with optional uncertainty quantification via Bayesian mode. The interactions list preserves per-QA scores for debugging.

Overview

  • Context Awareness: How closely the response follows the given context (0.0–1.0)
  • Session aggregate: Weighted mean across all interactions
  • Per-interaction detail: Each QA pair’s score accessible via interactions
  • Bayesian mode: Bootstrapped credible interval around the session mean

Installation

uv add "alquimia-fair-forge[context]"
uv add langchain-groq  # Or your preferred LLM provider

Basic Usage

from fair_forge.metrics.context import Context
from langchain_groq import ChatGroq
from your_retriever import MyRetriever

judge_model = ChatGroq(model="llama-3.3-70b-versatile", api_key="your-api-key", temperature=0.0)

metrics = Context.run(MyRetriever, model=judge_model)

for metric in metrics:
    print(f"Session: {metric.session_id}  ({metric.n_interactions} interactions)")
    print(f"  Context awareness: {metric.context_awareness:.2f}")

    for interaction in metric.interactions:
        status = "✅" if interaction.context_awareness >= 0.8 else "❌"
        print(f"  {status} [{interaction.qa_id}] {interaction.context_awareness:.2f}")

Parameters

Required Parameters

ParameterTypeDescription
retrieverType[Retriever]Data source class
modelBaseChatModelLangChain-compatible judge model

Optional Parameters

ParameterTypeDefaultDescription
statistical_modeStatisticalModeFrequentistMode()Statistical computation mode
use_structured_outputboolFalseUse LangChain structured output
bos_json_clausestr"```json"JSON block start marker
eos_json_clausestr"```"JSON block end marker
verboseboolFalseEnable verbose logging

Statistical Modes

Returns the weighted mean of per-interaction scores. CI fields are None.
metric.context_awareness          # 0.78
metric.context_awareness_ci_low   # None
metric.context_awareness_ci_high  # None

Interaction Weights

Each Batch can carry an optional weight to control its contribution to the session aggregate:
# Weight critical interactions more heavily
Batch(qa_id="q1", ..., weight=0.5),   # Most important
Batch(qa_id="q2", ..., weight=0.3),
Batch(qa_id="q3", ..., weight=0.2),   # Least important
CaseBehavior
All weights provided, sum = 1.0Used as-is
All weights provided, sum ≠ 1.0Warning emitted, equal weights applied
Some weights providedRemaining weight split equally among unweighted
No weights providedEqual weights (1/n each)

Output Schema

ContextMetric

class ContextMetric(BaseMetric):
    session_id: str
    assistant_id: str
    n_interactions: int                    # Number of interactions evaluated
    context_awareness: float               # Weighted mean (0.0-1.0)
    context_awareness_ci_low: float | None # Lower credible bound — Bayesian only
    context_awareness_ci_high: float | None # Upper credible bound — Bayesian only
    interactions: list[ContextInteraction]  # Per-QA scores

ContextInteraction

class ContextInteraction(BaseModel):
    qa_id: str
    context_awareness: float   # Per-interaction score (0.0-1.0)

Interpretation

Context Awareness Score

Score RangeInterpretation
0.8–1.0Excellent — response fully follows context
0.6–0.8Good — mostly follows context with minor deviations
0.4–0.6Moderate — partially follows context
0.2–0.4Poor — significant deviations
0.0–0.2Very poor — ignores or contradicts context

Complete Example

import os
from fair_forge.metrics.context import Context
from fair_forge.statistical import BayesianMode
from fair_forge.core.retriever import Retriever
from fair_forge.schemas.common import Dataset, Batch
from langchain_groq import ChatGroq

class ContextTestRetriever(Retriever):
    def load_dataset(self) -> list[Dataset]:
        context = """You are a helpful customer service assistant for TechStore.
        Key policies:
        - Returns accepted within 30 days with receipt
        - Free shipping on orders over $50
        - Support hours: Mon-Fri 9am-5pm EST
        Always be polite and offer to help further."""

        return [
            Dataset(
                session_id="context-eval-001",
                assistant_id="techstore-bot",
                language="english",
                context=context,
                conversation=[
                    Batch(
                        qa_id="q1",
                        query="What's your return policy?",
                        assistant="Returns within 30 days with a receipt. Anything else I can help with?",
                        ground_truth_assistant="Returns within 30 days with receipt.",
                    ),
                    Batch(
                        qa_id="q2",
                        query="Do you offer free shipping?",
                        assistant="Yes, free shipping on orders over $50.",
                        ground_truth_assistant="Free shipping on orders over $50.",
                    ),
                    Batch(
                        qa_id="q3",
                        query="What are your hours?",
                        assistant="We're open 24/7!",  # Wrong — context says Mon-Fri 9-5
                        ground_truth_assistant="Mon-Fri 9am-5pm EST.",
                    ),
                ]
            )
        ]

judge = ChatGroq(model="llama-3.3-70b-versatile", api_key=os.getenv("GROQ_API_KEY"), temperature=0.0)

metrics = Context.run(
    ContextTestRetriever,
    model=judge,
    statistical_mode=BayesianMode(mc_samples=5000, ci_level=0.95),
    use_structured_output=True,
)

for metric in metrics:
    print(f"Session: {metric.session_id}  ({metric.n_interactions} interactions)")
    ci = f"  [{metric.context_awareness_ci_low:.2f}, {metric.context_awareness_ci_high:.2f}]" \
         if metric.context_awareness_ci_low is not None else ""
    print(f"  Context awareness: {metric.context_awareness:.2f}{ci}")
    print()
    print("  Per-interaction:")
    for interaction in metric.interactions:
        status = "✅ PASS" if interaction.context_awareness >= 0.8 else "❌ FAIL"
        print(f"    [{interaction.qa_id}] {status}  score={interaction.context_awareness:.2f}")

LLM Provider Options

from langchain_groq import ChatGroq
judge = ChatGroq(model="llama-3.3-70b-versatile", api_key="your-api-key", temperature=0.0)

Best Practices

A session with 3 interactions gives a very uncertain mean. Bayesian mode expresses this with a wide CI, preventing overconfident conclusions.
Include specific, actionable instructions:
context = """You are a support assistant for Acme Corp.
Rules:
1. Always greet customers by name if available
2. Never discuss competitors
3. Escalate billing issues to human support"""
Provide ground_truth_assistant for better evaluation:
Batch(
    qa_id="q1",
    query="What's your refund policy?",
    assistant="Refunds take 5-7 business days...",
    ground_truth_assistant="Refunds processed within 5-7 business days to original payment method.",
)
If some QA pairs test more important context rules, give them higher weights:
Batch(qa_id="policy_question",   ..., weight=0.6),  # High-stakes
Batch(qa_id="greeting",          ..., weight=0.2),
Batch(qa_id="general_question",  ..., weight=0.2),

Next Steps

Statistical Modes

Frequentist vs Bayesian approaches

Conversational Metric

Evaluate dialogue quality with Grice’s maxims

Regulatory Metric

Compliance against a regulatory corpus