Skip to main content

Architecture

Fair Forge follows a simple yet powerful architecture designed for extensibility and ease of use.

Overview

Data Flow

The core data flow in Fair Forge is:
  1. Retriever loads your conversation data as list[Dataset]
  2. FairForge base class iterates through datasets
  3. Metric implementations process each conversation batch
  4. Results are collected in self.metrics
1

Load Data

Retriever.load_dataset() returns list[Dataset]
2

Process Datasets

FairForge._process() iterates through datasets
3

Compute Metrics

Metric.batch() processes each conversation
4

Collect Results

Results stored in self.metrics

Core Components

FairForge Base Class

All metrics inherit from FairForge (fair_forge/core/base.py):
from abc import ABC, abstractmethod
from typing import Type
from fair_forge.core.retriever import Retriever

class FairForge(ABC):
    def __init__(self, retriever: Type[Retriever], verbose: bool = False, **kwargs):
        self.retriever = retriever(**kwargs)
        self.metrics = []
        self.verbose = verbose

    @abstractmethod
    def batch(self, session_id: str, context: str, assistant_id: str,
              batch: list[Batch], language: str | None) -> None:
        """Process a batch of conversations. Implemented by each metric."""
        pass

    @classmethod
    def run(cls, retriever: Type[Retriever], **kwargs) -> list:
        """One-shot execution: instantiate and process."""
        instance = cls(retriever, **kwargs)
        instance._process()
        return instance.metrics

Retriever

Abstract base class for data loading:
from abc import ABC, abstractmethod
from fair_forge.schemas.common import Dataset

class Retriever(ABC):
    def __init__(self, **kwargs):
        pass

    @abstractmethod
    def load_dataset(self) -> list[Dataset]:
        """Load and return datasets for evaluation."""
        pass

Data Structures

Dataset: A complete conversation session
class Dataset(BaseModel):
    session_id: str          # Unique session identifier
    assistant_id: str        # ID of the assistant being evaluated
    language: str | None     # Language code (e.g., "english")
    context: str             # System context/instructions
    conversation: list[Batch] # List of Q&A interactions
Batch: A single Q&A interaction
class Batch(BaseModel):
    qa_id: str                          # Unique interaction ID
    query: str                          # User question
    assistant: str                      # Assistant response
    ground_truth_assistant: str | None  # Expected response
    observation: str | None             # Additional notes
    agentic: dict | None                # Metadata
    ground_truth_agentic: dict | None   # Expected metadata
    logprobs: dict | None               # Log probabilities

Metric Architecture

Each metric follows this pattern:
from fair_forge.core.base import FairForge

class MyMetric(FairForge):
    def __init__(self, retriever, verbose=False, **kwargs):
        super().__init__(retriever, verbose, **kwargs)
        # Initialize metric-specific components

    def batch(self, session_id, context, assistant_id, batch, language):
        # Process the batch and compute metrics
        result = self._compute(batch)
        self.metrics.append(result)

Statistical Modes

Fair Forge supports two statistical approaches:
Returns point estimates (floats):
from fair_forge.statistical import FrequentistMode

metrics = Toxicity.run(
    MyRetriever,
    statistical_mode=FrequentistMode(),
)
# Returns: metric.group_profiling.frequentist.DIDT = 0.33

Module Structure

fair_forge/
├── core/
│   ├── base.py           # FairForge base class
│   ├── retriever.py      # Retriever abstract class
│   ├── guardian.py       # Guardian interface (bias detection)
│   ├── sentiment.py      # Sentiment analyzer interface
│   ├── loader.py         # Toxicity loader interface
│   └── extractor.py      # Group extractor interface
├── metrics/
│   ├── toxicity.py       # Toxicity metric
│   ├── bias.py           # Bias metric
│   ├── context.py        # Context metric
│   ├── conversational.py # Conversational metric
│   ├── humanity.py       # Humanity metric
│   └── best_of.py        # BestOf metric
├── schemas/
│   ├── common.py         # Dataset, Batch schemas
│   ├── toxicity.py       # Toxicity result schemas
│   ├── bias.py           # Bias result schemas
│   └── ...               # Other metric schemas
├── statistical/
│   ├── base.py           # StatisticalMode interface
│   ├── frequentist.py    # Frequentist implementation
│   └── bayesian.py       # Bayesian implementation
├── generators/           # Test dataset generation
├── runners/              # Test execution
├── storage/              # Storage backends
├── llm/                  # LLM integration (Judge)
├── guardians/            # Guardian implementations
├── extractors/           # Group extractor implementations
└── loaders/              # Toxicity lexicon loaders

Extension Points

Fair Forge is designed for extensibility:
ComponentInterfacePurpose
Retrieverload_dataset()Load custom data sources
Guardianis_biased()Custom bias detection
SentimentAnalyzerinfer()Custom sentiment analysis
ToxicityLoaderload()Custom toxicity lexicons
BaseGroupExtractordetect_one()Custom group detection
StatisticalModeVarious methodsCustom statistical analysis
BaseRunnerrun_batch()Custom test execution
BaseStorageload_datasets()Custom storage backends

Next Steps