Welcome to Fair Forge
Fair Forge is a performance-measurement library developed by Alquimia AI for evaluating AI models and assistants. It provides comprehensive metrics for fairness, toxicity, bias, conversational quality, and more.Why Fair Forge?
As AI systems become increasingly integrated into our daily lives, ensuring they behave fairly, safely, and effectively is paramount. Fair Forge provides:- Fairness Evaluation: Detect and measure bias across protected attributes
- Toxicity Analysis: Identify toxic language patterns with demographic profiling
- Conversational Quality: Evaluate dialogue using Grice’s Maxims
- Context Awareness: Measure how well responses align with provided context
- Emotional Intelligence: Analyze emotional depth and human-likeness
- Model Comparison: Run tournament-style evaluations between multiple assistants
Key Features
Multiple Metrics
Six specialized metrics for comprehensive AI evaluation
Statistical Modes
Choose between Frequentist and Bayesian statistical approaches
Test Generation
Generate synthetic test datasets from your documentation
Flexible Runners
Execute tests against any LLM or custom AI system
Quick Example
Architecture Overview
Fair Forge follows a simple yet powerful architecture:1
Load Data
Retriever.load_dataset() returns list[Dataset]2
Process
FairForge._process() iterates datasets3
Evaluate
Metric.batch() processes each conversation4
Results
Collected in
self.metricsFairForge base class and implement the batch() method to process conversation batches. Users provide data through custom Retriever implementations.