Skip to main content

Generators Overview

Fair Forge generators create synthetic test datasets from your documentation, enabling automated testing of AI assistants without manual dataset creation.

Why Use Generators?

Save Time

Automatically create test cases from existing documentation

Better Coverage

Generate diverse questions across all your content

Consistent Quality

Structured question generation with difficulty levels

Easy Updates

Regenerate tests when documentation changes

Installation

uv pip install "alquimia-fair-forge[generators]"
uv pip install langchain-groq  # Or your preferred LLM provider

Quick Start

from fair_forge.generators import BaseGenerator, create_markdown_loader
from langchain_groq import ChatGroq

# Create a context loader
loader = create_markdown_loader(
    max_chunk_size=2000,
    header_levels=[1, 2, 3],
)

# Create generator with an LLM
model = ChatGroq(model="llama-3.1-8b-instant", temperature=0.4)
generator = BaseGenerator(model=model, use_structured_output=True)

# Generate test dataset
datasets = await generator.generate_dataset(
    context_loader=loader,
    source="./documentation.md",
    assistant_id="my-assistant",
    num_queries_per_chunk=3,
    language="english",
)

# Use with metrics
for dataset in datasets:
    print(f"Generated {len(dataset.conversation)} test queries")

Key Components

BaseGenerator

The main class for generating test datasets:
from fair_forge.generators import BaseGenerator

generator = BaseGenerator(
    model=your_langchain_model,
    use_structured_output=True,
)

Context Loaders

Load and chunk your documentation:
from fair_forge.generators import create_markdown_loader

loader = create_markdown_loader(
    max_chunk_size=2000,
    header_levels=[1, 2, 3],
)

Selection Strategies

Control how chunks are selected:
from fair_forge.generators import SequentialStrategy, RandomSamplingStrategy

# Process all chunks sequentially (default)
strategy = SequentialStrategy()

# Sample random chunks multiple times
strategy = RandomSamplingStrategy(
    num_samples=3,
    chunks_per_sample=5,
)

Generation Modes

Independent Queries

Generate standalone questions:
datasets = await generator.generate_dataset(
    context_loader=loader,
    source="./docs",
    num_queries_per_chunk=3,
    conversation_mode=False,  # Default
)

Conversation Mode

Generate coherent multi-turn conversations:
datasets = await generator.generate_dataset(
    context_loader=loader,
    source="./docs",
    num_queries_per_chunk=3,
    conversation_mode=True,  # Each turn builds on previous
)

Output Format

Generated datasets follow the standard Fair Forge schema:
Dataset(
    session_id="generated-uuid",
    assistant_id="my-assistant",
    language="english",
    context="Combined chunk content...",
    conversation=[
        Batch(
            qa_id="chunk-1_q1",
            query="Generated question?",
            assistant="",  # Empty - to be filled by runner
            agentic={
                "difficulty": "medium",
                "query_type": "factual",
                "chunk_id": "doc_section_1",
            },
        ),
        ...
    ]
)

Workflow

Supported LLM Providers

ProviderImportNotes
Groqlangchain_groq.ChatGroqFast, free tier available
OpenAIlangchain_openai.ChatOpenAIGPT-4, GPT-3.5
Googlelangchain_google_genai.ChatGoogleGenerativeAIGemini models
Anthropiclangchain_anthropic.ChatAnthropicClaude models
Ollamalangchain_ollama.ChatOllamaLocal models

Next Steps