Skip to main content

Conversational Metric

The Conversational metric evaluates dialogue quality using Grice’s Maxims — principles of cooperative conversation that define effective communication. It accumulates scores across all interactions in a session and emits one session-level result, with optional uncertainty quantification via Bayesian mode.

Overview

The metric assesses seven dimensions:
DimensionDescriptionScale
Quality MaximTruthfulness and evidence-based responses0-10
Quantity MaximAppropriate amount of information0-10
Relation MaximRelevance to the conversation0-10
Manner MaximClarity and organization0-10
MemoryAbility to recall previous context0-10
LanguageAppropriateness of language style0-10
SensiblenessOverall coherence and logic0-10
Each dimension produces a session-level ConversationalScore with a mean and optional credible interval (ci_low, ci_high) in Bayesian mode. The interactions list preserves per-QA scores for debugging.

Installation

uv add "alquimia-fair-forge[conversational]"
uv add langchain-groq  # Or your preferred LLM provider

Basic Usage

from fair_forge.metrics.conversational import Conversational
from langchain_groq import ChatGroq
from your_retriever import MyRetriever

judge_model = ChatGroq(model="llama-3.3-70b-versatile", api_key="your-api-key", temperature=0.0)

metrics = Conversational.run(MyRetriever, model=judge_model)

for metric in metrics:
    print(f"Session: {metric.session_id}  ({metric.n_interactions} interactions)")
    print(f"  Quality:      {metric.conversational_quality_maxim.mean:.1f}/10")
    print(f"  Memory:       {metric.conversational_memory.mean:.1f}/10")
    print(f"  Sensibleness: {metric.conversational_sensibleness.mean:.1f}/10")

    # Per-interaction breakdown
    for interaction in metric.interactions:
        print(f"  [{interaction.qa_id}] quality={interaction.quality_maxim:.1f} memory={interaction.memory:.1f}")

Parameters

Required Parameters

ParameterTypeDescription
retrieverType[Retriever]Data source class
modelBaseChatModelLangChain-compatible judge model

Optional Parameters

ParameterTypeDefaultDescription
statistical_modeStatisticalModeFrequentistMode()Statistical computation mode
use_structured_outputboolFalseUse LangChain structured output
bos_json_clausestr"```json"JSON block start marker
eos_json_clausestr"```"JSON block end marker
verboseboolFalseEnable verbose logging

Statistical Modes

Returns a weighted mean per dimension. ci_low and ci_high are None.
metric.conversational_quality_maxim.mean   # 7.8
metric.conversational_quality_maxim.ci_low  # None

Interaction Weights

Each Batch can carry an optional weight to control its contribution to the session aggregate:
Batch(qa_id="q1", query="...", assistant="...", ground_truth_assistant="...", weight=0.5),
Batch(qa_id="q2", query="...", assistant="...", ground_truth_assistant="...", weight=0.3),
Batch(qa_id="q3", query="...", assistant="...", ground_truth_assistant="...", weight=0.2),
CaseBehavior
All weights provided, sum = 1.0Used as-is
All weights provided, sum ≠ 1.0Warning emitted, equal weights applied
Some weights providedRemaining weight split equally among unweighted interactions
No weights providedEqual weights (1/n each)

Output Schema

ConversationalMetric

class ConversationalMetric(BaseMetric):
    session_id: str
    assistant_id: str
    n_interactions: int                                    # Number of interactions evaluated
    conversational_memory: ConversationalScore
    conversational_language: ConversationalScore
    conversational_quality_maxim: ConversationalScore
    conversational_quantity_maxim: ConversationalScore
    conversational_relation_maxim: ConversationalScore
    conversational_manner_maxim: ConversationalScore
    conversational_sensibleness: ConversationalScore
    interactions: list[ConversationalInteraction]          # Per-QA scores

ConversationalScore

class ConversationalScore(BaseModel):
    mean: float           # Session-level weighted mean
    ci_low: float | None  # Lower credible bound — Bayesian mode only
    ci_high: float | None # Upper credible bound — Bayesian mode only

ConversationalInteraction

class ConversationalInteraction(BaseModel):
    qa_id: str
    memory: float
    language: float
    quality_maxim: float
    quantity_maxim: float
    relation_maxim: float
    manner_maxim: float
    sensibleness: float

Understanding Grice’s Maxims

Quality Maxim

Be truthful: Don’t say what you believe to be false or lack evidence for.
High (8-10): "The capital of France is Paris." (Verifiable fact)
Low  (0-4):  "France doesn't have a capital." (False)

Quantity Maxim

Be informative: Provide enough information, but not more than required.
High (8-10): "Paris is the capital of France."
Low  (0-4):  "Paris." (Too brief) or a 3-paragraph essay (Too much)

Relation Maxim

Be relevant: Make your contribution relevant to the conversation.
High (8-10): Q: "What's your return policy?" A: "Returns accepted within 30 days."
Low  (0-4):  Q: "What's your return policy?" A: "Our company was founded in 1998."

Manner Maxim

Be clear: Avoid obscurity and ambiguity.
High (8-10): "You can return items within 30 days at any store location."
Low  (0-4):  "So basically, if you want to, you could possibly maybe return the thing..."

Complete Example

import os
from fair_forge.metrics.conversational import Conversational
from fair_forge.statistical import BayesianMode
from fair_forge.core.retriever import Retriever
from fair_forge.schemas.common import Dataset, Batch
from langchain_groq import ChatGroq

class ConversationalRetriever(Retriever):
    def load_dataset(self) -> list[Dataset]:
        return [
            Dataset(
                session_id="conv-eval-001",
                assistant_id="support-bot",
                language="english",
                context="You are a helpful, professional customer service assistant.",
                conversation=[
                    Batch(
                        qa_id="q1",
                        query="Hi, I need help with my order.",
                        assistant="Hello! I'd be happy to help. Could you share your order number?",
                        ground_truth_assistant="Greet and ask for order number.",
                        observation="Opening interaction - should be professional and helpful",
                    ),
                    Batch(
                        qa_id="q2",
                        query="It's ORDER-12345. I haven't received it yet.",
                        assistant="Thank you! ORDER-12345 was shipped Monday, expected Friday.",
                        ground_truth_assistant="Find order, provide shipping status and ETA.",
                    ),
                    Batch(
                        qa_id="q3",
                        query="Can you change the delivery address?",
                        assistant="For security, please confirm your email address first.",
                        ground_truth_assistant="Offer to help, verify identity first.",
                    ),
                ]
            )
        ]

judge = ChatGroq(model="llama-3.3-70b-versatile", api_key=os.getenv("GROQ_API_KEY"), temperature=0.0)

metrics = Conversational.run(
    ConversationalRetriever,
    model=judge,
    statistical_mode=BayesianMode(mc_samples=5000, ci_level=0.95),
    use_structured_output=True,
)

for metric in metrics:
    print(f"Session: {metric.session_id}  ({metric.n_interactions} interactions)")
    print()

    dimensions = [
        ("Quality",      metric.conversational_quality_maxim),
        ("Quantity",     metric.conversational_quantity_maxim),
        ("Relation",     metric.conversational_relation_maxim),
        ("Manner",       metric.conversational_manner_maxim),
        ("Memory",       metric.conversational_memory),
        ("Language",     metric.conversational_language),
        ("Sensibleness", metric.conversational_sensibleness),
    ]

    for name, score in dimensions:
        ci = f"  [{score.ci_low:.1f}, {score.ci_high:.1f}]" if score.ci_low is not None else ""
        print(f"  {name:<14} {score.mean:.1f}/10{ci}")

    print()
    print("  Per-interaction breakdown:")
    for interaction in metric.interactions:
        print(f"    [{interaction.qa_id}] quality={interaction.quality_maxim:.1f}  memory={interaction.memory:.1f}  manner={interaction.manner_maxim:.1f}")

Score Interpretation

Score RangeInterpretation
8-10Excellent — high-quality dialogue
6-8Good — meets expectations with minor issues
4-6Moderate — noticeable quality issues
2-4Poor — significant problems
0-2Very poor — fails basic criteria

Best Practices

If a session has fewer than 5-10 interactions, the frequentist mean can be misleading. Bayesian mode shows a CI, making it clear when more data is needed.
Add observation to guide the judge on what to evaluate:
Batch(
    qa_id="q2",
    observation="Follow-up — assistant should remember the order number from q1",
)
Include sequences that test memory:
Batch(qa_id="q1", query="My name is John..."),
Batch(qa_id="q2", query="What's my name?"),  # Should remember

Next Steps

Statistical Modes

Frequentist vs Bayesian — when each matters

Context Metric

Evaluate context alignment

Humanity Metric

Emotional analysis of responses