Conversational Metric
The Conversational metric evaluates dialogue quality using Grice’s Maxims — principles of cooperative conversation that define effective communication. It accumulates scores across all interactions in a session and emits one session-level result, with optional uncertainty quantification via Bayesian mode.Overview
The metric assesses seven dimensions:| Dimension | Description | Scale |
|---|---|---|
| Quality Maxim | Truthfulness and evidence-based responses | 0-10 |
| Quantity Maxim | Appropriate amount of information | 0-10 |
| Relation Maxim | Relevance to the conversation | 0-10 |
| Manner Maxim | Clarity and organization | 0-10 |
| Memory | Ability to recall previous context | 0-10 |
| Language | Appropriateness of language style | 0-10 |
| Sensibleness | Overall coherence and logic | 0-10 |
ConversationalScore with a mean and optional credible interval (ci_low, ci_high) in Bayesian mode. The interactions list preserves per-QA scores for debugging.
Installation
Basic Usage
Parameters
Required Parameters
| Parameter | Type | Description |
|---|---|---|
retriever | Type[Retriever] | Data source class |
model | BaseChatModel | LangChain-compatible judge model |
Optional Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
statistical_mode | StatisticalMode | FrequentistMode() | Statistical computation mode |
use_structured_output | bool | False | Use LangChain structured output |
bos_json_clause | str | "```json" | JSON block start marker |
eos_json_clause | str | "```" | JSON block end marker |
verbose | bool | False | Enable verbose logging |
Statistical Modes
- Frequentist
- Bayesian
Returns a weighted mean per dimension.
ci_low and ci_high are None.Interaction Weights
EachBatch can carry an optional weight to control its contribution to the session aggregate:
| Case | Behavior |
|---|---|
| All weights provided, sum = 1.0 | Used as-is |
| All weights provided, sum ≠ 1.0 | Warning emitted, equal weights applied |
| Some weights provided | Remaining weight split equally among unweighted interactions |
| No weights provided | Equal weights (1/n each) |
Output Schema
ConversationalMetric
ConversationalScore
ConversationalInteraction
Understanding Grice’s Maxims
Quality Maxim
Be truthful: Don’t say what you believe to be false or lack evidence for.Quantity Maxim
Be informative: Provide enough information, but not more than required.Relation Maxim
Be relevant: Make your contribution relevant to the conversation.Manner Maxim
Be clear: Avoid obscurity and ambiguity.Complete Example
Score Interpretation
| Score Range | Interpretation |
|---|---|
| 8-10 | Excellent — high-quality dialogue |
| 6-8 | Good — meets expectations with minor issues |
| 4-6 | Moderate — noticeable quality issues |
| 2-4 | Poor — significant problems |
| 0-2 | Very poor — fails basic criteria |
Best Practices
Use Bayesian Mode for Small Sessions
Use Bayesian Mode for Small Sessions
If a session has fewer than 5-10 interactions, the frequentist mean can be misleading. Bayesian mode shows a CI, making it clear when more data is needed.
Include Observations
Include Observations
Add
observation to guide the judge on what to evaluate:Test Multi-Turn Conversations
Test Multi-Turn Conversations
Include sequences that test memory:
Next Steps
Statistical Modes
Frequentist vs Bayesian — when each matters
Context Metric
Evaluate context alignment
Humanity Metric
Emotional analysis of responses