Skip to main content

Metrics Overview

Fair Forge provides six specialized metrics for comprehensive AI evaluation. Each metric focuses on a different aspect of AI behavior and quality.

Available Metrics

Toxicity

Measures toxic language with clustering and demographic group profiling using the DIDT framework.

Bias

Detects bias across protected attributes (gender, race, religion, nationality, sexual orientation).

Context

Evaluates how well responses align with provided context and instructions.

Conversational

Evaluates dialogue quality using Grice’s Maxims (Quality, Quantity, Relation, Manner).

Humanity

Analyzes emotional depth and human-likeness using the NRC Emotion Lexicon.

BestOf

Tournament-style evaluation to compare multiple assistants head-to-head.

Comparison Table

MetricPurposeOutput TypeLLM Required
ToxicityDetect toxic language patternsPer-session metricsNo
BiasIdentify biased responsesPer-session metricsYes (Guardian)
ContextMeasure context alignmentPer-interaction scoresYes (Judge)
ConversationalEvaluate dialogue qualityPer-interaction scoresYes (Judge)
HumanityAnalyze emotional expressionPer-interaction scoresNo
BestOfCompare multiple assistantsTournament resultsYes (Judge)

Common Usage Pattern

All metrics follow the same usage pattern:
from fair_forge.metrics.<metric> import <Metric>
from fair_forge.core.retriever import Retriever

# 1. Define your retriever
class MyRetriever(Retriever):
    def load_dataset(self):
        # Return list[Dataset]
        pass

# 2. Run the metric
results = <Metric>.run(
    MyRetriever,
    **metric_specific_parameters,
    verbose=True,
)

# 3. Analyze results
for result in results:
    # Process metric-specific output
    pass

Metric Categories

Lexicon-Based Metrics

These metrics use predefined lexicons and don’t require external LLMs:
  • Toxicity: Uses Hurtlex toxicity lexicon + HDBSCAN clustering
  • Humanity: Uses NRC Emotion Lexicon for emotion detection
# No LLM required
from fair_forge.metrics.toxicity import Toxicity

results = Toxicity.run(
    MyRetriever,
    group_prototypes={...},
)

LLM-Judge Metrics

These metrics use an LLM as a judge to evaluate responses:
  • Context: Evaluates context alignment
  • Conversational: Evaluates dialogue quality
  • BestOf: Compares assistants in tournaments
# Requires LangChain-compatible model
from fair_forge.metrics.context import Context
from langchain_groq import ChatGroq

judge = ChatGroq(model="llama-3.3-70b-versatile", api_key="...")

results = Context.run(
    MyRetriever,
    model=judge,
    use_structured_output=True,
)

Guardian-Based Metrics

These metrics use specialized guardian models for detection:
  • Bias: Uses LlamaGuard or IBMGranite for bias detection
from fair_forge.metrics.bias import Bias
from fair_forge.guardians import LLamaGuard

results = Bias.run(
    MyRetriever,
    guardian=LLamaGuard,
    config=guardian_config,
)

Output Schemas

Each metric returns a list of result objects. The schema depends on the metric:
ToxicityMetric:
  session_id: str
  assistant_id: str
  cluster_profiling: dict[float, float]  # cluster_id -> toxicity_score
  group_profiling: GroupProfiling | None
  assistant_space: AssistantSpace

Installation Requirements

Each metric has specific dependencies:
# Toxicity (clustering + embeddings)
uv add "alquimia-fair-forge[toxicity]"

# Bias (guardian models)
uv add "alquimia-fair-forge[bias]"

# Context (LLM judge)
uv add "alquimia-fair-forge[context]"

# Conversational (LLM judge)
uv add "alquimia-fair-forge[conversational]"

# Humanity (included in core)
uv add "alquimia-fair-forge[humanity]"

# BestOf (LLM judge)
uv add "alquimia-fair-forge[bestof]"

# All metrics
uv add "alquimia-fair-forge[all]"

Choosing a Metric

Use Toxicity for detecting toxic language patterns and demographic targeting.
Use Bias for detecting discrimination across protected attributes.
Use Context for measuring alignment with system context.
Use Conversational for assessing dialogue using Grice’s Maxims.
Use Humanity for analyzing emotional depth and human-likeness.
Use BestOf for tournament-style head-to-head comparisons.

Next Steps

Toxicity

Learn about toxicity detection

Bias

Learn about bias detection

Context

Learn about context evaluation