Skip to main content

Metrics Overview

Fair Forge provides six specialized metrics for comprehensive AI evaluation. Each metric focuses on a different aspect of AI behavior and quality.

Available Metrics

Comparison Table

MetricPurposeOutput TypeLLM Required
ToxicityDetect toxic language patternsPer-session metricsNo
BiasIdentify biased responsesPer-session metricsYes (Guardian)
ContextMeasure context alignmentPer-interaction scoresYes (Judge)
ConversationalEvaluate dialogue qualityPer-interaction scoresYes (Judge)
HumanityAnalyze emotional expressionPer-interaction scoresNo
BestOfCompare multiple assistantsTournament resultsYes (Judge)

Common Usage Pattern

All metrics follow the same usage pattern:
from fair_forge.metrics.<metric> import <Metric>
from fair_forge.core.retriever import Retriever

# 1. Define your retriever
class MyRetriever(Retriever):
    def load_dataset(self):
        # Return list[Dataset]
        pass

# 2. Run the metric
results = <Metric>.run(
    MyRetriever,
    **metric_specific_parameters,
    verbose=True,
)

# 3. Analyze results
for result in results:
    # Process metric-specific output
    pass

Metric Categories

Lexicon-Based Metrics

These metrics use predefined lexicons and don’t require external LLMs:
  • Toxicity: Uses Hurtlex toxicity lexicon + HDBSCAN clustering
  • Humanity: Uses NRC Emotion Lexicon for emotion detection
# No LLM required
from fair_forge.metrics.toxicity import Toxicity

results = Toxicity.run(
    MyRetriever,
    group_prototypes={...},
)

LLM-Judge Metrics

These metrics use an LLM as a judge to evaluate responses:
  • Context: Evaluates context alignment
  • Conversational: Evaluates dialogue quality
  • BestOf: Compares assistants in tournaments
# Requires LangChain-compatible model
from fair_forge.metrics.context import Context
from langchain_groq import ChatGroq

judge = ChatGroq(model="llama-3.3-70b-versatile", api_key="...")

results = Context.run(
    MyRetriever,
    model=judge,
    use_structured_output=True,
)

Guardian-Based Metrics

These metrics use specialized guardian models for detection:
  • Bias: Uses LlamaGuard or IBMGranite for bias detection
from fair_forge.metrics.bias import Bias
from fair_forge.guardians import LLamaGuard

results = Bias.run(
    MyRetriever,
    guardian=LLamaGuard,
    config=guardian_config,
)

Output Schemas

Each metric returns a list of result objects. The schema depends on the metric:
ToxicityMetric:
  session_id: str
  assistant_id: str
  cluster_profiling: dict[float, float]  # cluster_id -> toxicity_score
  group_profiling: GroupProfiling | None
  assistant_space: AssistantSpace

Installation Requirements

Each metric has specific dependencies:
# Toxicity (clustering + embeddings)
uv pip install "alquimia-fair-forge[toxicity]"

# Bias (guardian models)
uv pip install "alquimia-fair-forge[bias]"

# Context (LLM judge)
uv pip install "alquimia-fair-forge[context]"

# Conversational (LLM judge)
uv pip install "alquimia-fair-forge[conversational]"

# Humanity (included in core)
uv pip install "alquimia-fair-forge[humanity]"

# BestOf (LLM judge)
uv pip install "alquimia-fair-forge[bestof]"

# All metrics
uv pip install "alquimia-fair-forge[all]"

Choosing a Metric

Use Toxicity for detecting toxic language patterns and demographic targeting.
Use Bias for detecting discrimination across protected attributes.
Use Context for measuring alignment with system context.
Use Conversational for assessing dialogue using Grice’s Maxims.
Use Humanity for analyzing emotional depth and human-likeness.
Use BestOf for tournament-style head-to-head comparisons.

Next Steps