Welcome to Fair Forge

Fair Forge is a performance-measurement library developed by Alquimia AI for evaluating AI models and assistants. It provides comprehensive metrics for fairness, toxicity, bias, conversational quality, and more.

Why Fair Forge?

As AI systems become increasingly integrated into our daily lives, ensuring they behave fairly, safely, and effectively is paramount. Fair Forge provides:

Fairness Evaluation: Detect and measure bias across protected attributes
Toxicity Analysis: Identify toxic language patterns with demographic profiling
Conversational Quality: Evaluate dialogue using Grice’s Maxims
Context Awareness: Measure how well responses align with provided context
Emotional Intelligence: Analyze emotional depth and human-likeness
Model Comparison: Run tournament-style evaluations between multiple assistants

Key Features

Multiple Metrics

Six specialized metrics for comprehensive AI evaluation

Statistical Modes

Choose between Frequentist and Bayesian statistical approaches

Test Generation

Generate synthetic test datasets from your documentation

Flexible Runners

Execute tests against any LLM or custom AI system

Quick Example

from fair_forge.metrics.toxicity import Toxicity
from fair_forge.core.retriever import Retriever
from fair_forge.schemas.common import Dataset, Batch

# Define a custom retriever to load your data
class MyRetriever(Retriever):
    def load_dataset(self) -> list[Dataset]:
        return [
            Dataset(
                session_id="session-1",
                assistant_id="my-assistant",
                language="english",
                context="",
                conversation=[
                    Batch(
                        qa_id="q1",
                        query="Tell me about AI safety",
                        assistant="AI safety is important...",
                    )
                ]
            )
        ]

# Run the toxicity metric
results = Toxicity.run(
    MyRetriever,
    group_prototypes={
        "gender": ["women", "men", "female", "male"],
        "race": ["Asian", "African", "European"],
    },
    verbose=True,
)

# Analyze results
for metric in results:
    print(f"DIDT Score: {metric.group_profiling.frequentist.DIDT}")

Architecture Overview

Fair Forge follows a simple yet powerful architecture:

Load Data

Retriever.load_dataset() returns list[Dataset]

Process

FairForge._process() iterates datasets

Evaluate

Metric.batch() processes each conversation

Results

Collected in self.metrics

All metrics inherit from the FairForge base class and implement the batch() method to process conversation batches. Users provide data through custom Retriever implementations.

Next Steps

Quickstart

Get started with Fair Forge in minutes

Installation

Install Fair Forge and dependencies

Core Concepts

Learn the fundamental concepts

Metrics

Explore available metrics

Getting Started

Core Concepts

Metrics

Generators

Runners

Storage

Introduction

Welcome to Fair Forge

Why Fair Forge?

Key Features

Multiple Metrics

Statistical Modes

Test Generation

Flexible Runners

Quick Example

Architecture Overview

Next Steps

Quickstart

Installation

Core Concepts

Metrics

Getting Started

Core Concepts

Metrics

Generators

Runners

Storage

​Welcome to Fair Forge

​Why Fair Forge?

​Key Features

Multiple Metrics

Statistical Modes

Test Generation

Flexible Runners

​Quick Example

​Architecture Overview

​Next Steps

Quickstart

Installation

Core Concepts

Metrics

Welcome to Fair Forge

Why Fair Forge?

Key Features

Quick Example

Architecture Overview

Next Steps