The Context metric evaluates how well an AI assistant’s responses align with the provided system context. It accumulates context_awareness scores across all interactions in a session and emits one session-level result, with optional uncertainty quantification via Bayesian mode. The interactions list preserves per-QA scores for debugging.
Each Batch can carry an optional weight to control its contribution to the session aggregate:
# Weight critical interactions more heavilyBatch(qa_id="q1", ..., weight=0.5), # Most importantBatch(qa_id="q2", ..., weight=0.3),Batch(qa_id="q3", ..., weight=0.2), # Least important
import osfrom fair_forge.metrics.context import Contextfrom fair_forge.statistical import BayesianModefrom fair_forge.core.retriever import Retrieverfrom fair_forge.schemas.common import Dataset, Batchfrom langchain_groq import ChatGroqclass ContextTestRetriever(Retriever): def load_dataset(self) -> list[Dataset]: context = """You are a helpful customer service assistant for TechStore. Key policies: - Returns accepted within 30 days with receipt - Free shipping on orders over $50 - Support hours: Mon-Fri 9am-5pm EST Always be polite and offer to help further.""" return [ Dataset( session_id="context-eval-001", assistant_id="techstore-bot", language="english", context=context, conversation=[ Batch( qa_id="q1", query="What's your return policy?", assistant="Returns within 30 days with a receipt. Anything else I can help with?", ground_truth_assistant="Returns within 30 days with receipt.", ), Batch( qa_id="q2", query="Do you offer free shipping?", assistant="Yes, free shipping on orders over $50.", ground_truth_assistant="Free shipping on orders over $50.", ), Batch( qa_id="q3", query="What are your hours?", assistant="We're open 24/7!", # Wrong — context says Mon-Fri 9-5 ground_truth_assistant="Mon-Fri 9am-5pm EST.", ), ] ) ]judge = ChatGroq(model="llama-3.3-70b-versatile", api_key=os.getenv("GROQ_API_KEY"), temperature=0.0)metrics = Context.run( ContextTestRetriever, model=judge, statistical_mode=BayesianMode(mc_samples=5000, ci_level=0.95), use_structured_output=True,)for metric in metrics: print(f"Session: {metric.session_id} ({metric.n_interactions} interactions)") ci = f" [{metric.context_awareness_ci_low:.2f}, {metric.context_awareness_ci_high:.2f}]" \ if metric.context_awareness_ci_low is not None else "" print(f" Context awareness: {metric.context_awareness:.2f}{ci}") print() print(" Per-interaction:") for interaction in metric.interactions: status = "✅ PASS" if interaction.context_awareness >= 0.8 else "❌ FAIL" print(f" [{interaction.qa_id}] {status} score={interaction.context_awareness:.2f}")
A session with 3 interactions gives a very uncertain mean. Bayesian mode expresses this with a wide CI, preventing overconfident conclusions.
Provide Clear Context
Include specific, actionable instructions:
context = """You are a support assistant for Acme Corp.Rules:1. Always greet customers by name if available2. Never discuss competitors3. Escalate billing issues to human support"""
Include Ground Truth
Provide ground_truth_assistant for better evaluation:
Batch( qa_id="q1", query="What's your refund policy?", assistant="Refunds take 5-7 business days...", ground_truth_assistant="Refunds processed within 5-7 business days to original payment method.",)
Weight Critical Interactions
If some QA pairs test more important context rules, give them higher weights: