Metrics Overview
Fair Forge provides six specialized metrics for comprehensive AI evaluation. Each metric focuses on a different aspect of AI behavior and quality.Available Metrics
Toxicity
Measures toxic language with clustering and demographic group profiling using the DIDT framework.
Bias
Detects bias across protected attributes (gender, race, religion, nationality, sexual orientation).
Context
Evaluates how well responses align with provided context and instructions.
Conversational
Evaluates dialogue quality using Grice’s Maxims (Quality, Quantity, Relation, Manner).
Humanity
Analyzes emotional depth and human-likeness using the NRC Emotion Lexicon.
BestOf
Tournament-style evaluation to compare multiple assistants head-to-head.
Comparison Table
| Metric | Purpose | Output Type | LLM Required |
|---|---|---|---|
| Toxicity | Detect toxic language patterns | Per-session metrics | No |
| Bias | Identify biased responses | Per-session metrics | Yes (Guardian) |
| Context | Measure context alignment | Per-interaction scores | Yes (Judge) |
| Conversational | Evaluate dialogue quality | Per-interaction scores | Yes (Judge) |
| Humanity | Analyze emotional expression | Per-interaction scores | No |
| BestOf | Compare multiple assistants | Tournament results | Yes (Judge) |
Common Usage Pattern
All metrics follow the same usage pattern:Metric Categories
Lexicon-Based Metrics
These metrics use predefined lexicons and don’t require external LLMs:- Toxicity: Uses Hurtlex toxicity lexicon + HDBSCAN clustering
- Humanity: Uses NRC Emotion Lexicon for emotion detection
LLM-Judge Metrics
These metrics use an LLM as a judge to evaluate responses:- Context: Evaluates context alignment
- Conversational: Evaluates dialogue quality
- BestOf: Compares assistants in tournaments
Guardian-Based Metrics
These metrics use specialized guardian models for detection:- Bias: Uses LlamaGuard or IBMGranite for bias detection
Output Schemas
Each metric returns a list of result objects. The schema depends on the metric:- Toxicity
- Bias
- Context
- Conversational
- Humanity
- BestOf
Installation Requirements
Each metric has specific dependencies:Choosing a Metric
I want to detect harmful language
I want to detect harmful language
Use Toxicity for detecting toxic language patterns and demographic targeting.
I want to check for bias
I want to check for bias
Use Bias for detecting discrimination across protected attributes.
I want to ensure responses follow instructions
I want to ensure responses follow instructions
Use Context for measuring alignment with system context.
I want to evaluate conversation quality
I want to evaluate conversation quality
Use Conversational for assessing dialogue using Grice’s Maxims.
I want to measure emotional expression
I want to measure emotional expression
Use Humanity for analyzing emotional depth and human-likeness.
I want to compare multiple assistants
I want to compare multiple assistants
Use BestOf for tournament-style head-to-head comparisons.