Context Metric
The Context metric evaluates how well an AI assistant’s responses align with the provided system context. It accumulatescontext_awareness scores across all interactions in a session and emits one session-level result, with optional uncertainty quantification via Bayesian mode. The interactions list preserves per-QA scores for debugging.
Overview
- Context Awareness: How closely the response follows the given context (0.0–1.0)
- Session aggregate: Weighted mean across all interactions
- Per-interaction detail: Each QA pair’s score accessible via
interactions - Bayesian mode: Bootstrapped credible interval around the session mean
Installation
Basic Usage
Parameters
Required Parameters
| Parameter | Type | Description |
|---|---|---|
retriever | Type[Retriever] | Data source class |
model | BaseChatModel | LangChain-compatible judge model |
Optional Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
statistical_mode | StatisticalMode | FrequentistMode() | Statistical computation mode |
use_structured_output | bool | False | Use LangChain structured output |
bos_json_clause | str | "```json" | JSON block start marker |
eos_json_clause | str | "```" | JSON block end marker |
verbose | bool | False | Enable verbose logging |
Statistical Modes
- Frequentist
- Bayesian
Returns the weighted mean of per-interaction scores. CI fields are
None.Interaction Weights
EachBatch can carry an optional weight to control its contribution to the session aggregate:
| Case | Behavior |
|---|---|
| All weights provided, sum = 1.0 | Used as-is |
| All weights provided, sum ≠ 1.0 | Warning emitted, equal weights applied |
| Some weights provided | Remaining weight split equally among unweighted |
| No weights provided | Equal weights (1/n each) |
Output Schema
ContextMetric
ContextInteraction
Interpretation
Context Awareness Score
| Score Range | Interpretation |
|---|---|
| 0.8–1.0 | Excellent — response fully follows context |
| 0.6–0.8 | Good — mostly follows context with minor deviations |
| 0.4–0.6 | Moderate — partially follows context |
| 0.2–0.4 | Poor — significant deviations |
| 0.0–0.2 | Very poor — ignores or contradicts context |
Complete Example
LLM Provider Options
Best Practices
Use Bayesian Mode for Small Sessions
Use Bayesian Mode for Small Sessions
A session with 3 interactions gives a very uncertain mean. Bayesian mode expresses this with a wide CI, preventing overconfident conclusions.
Provide Clear Context
Provide Clear Context
Include specific, actionable instructions:
Include Ground Truth
Include Ground Truth
Provide
ground_truth_assistant for better evaluation:Weight Critical Interactions
Weight Critical Interactions
If some QA pairs test more important context rules, give them higher weights:
Next Steps
Statistical Modes
Frequentist vs Bayesian approaches
Conversational Metric
Evaluate dialogue quality with Grice’s maxims
Regulatory Metric
Compliance against a regulatory corpus