Skip to main content

Context Loaders

Context loaders prepare your documentation for test generation by parsing and chunking content.

Markdown Loader

The primary loader for markdown documentation:
from fair_forge.generators import create_markdown_loader

loader = create_markdown_loader(
    max_chunk_size=2000,
    header_levels=[1, 2, 3],
)

# Load and chunk content
chunks = loader.load("./documentation.md")

for chunk in chunks:
    print(f"Chunk: {chunk.chunk_id}")
    print(f"Content: {chunk.content[:100]}...")

Parameters

create_markdown_loader

ParameterTypeDefaultDescription
max_chunk_sizeint2000Maximum characters per chunk
header_levelslist[int][1, 2, 3]Header levels to split on
min_chunk_sizeint100Minimum characters per chunk

Chunk Structure

Each chunk contains:
class ContentChunk:
    chunk_id: str      # Unique identifier
    content: str       # Text content
    metadata: dict     # Additional metadata
The chunk_id is derived from the file name and headers:
my_docs_getting_started_installation
|______|  |___________|  |__________|
  file      header 1       header 2

Examples

Basic Loading

from fair_forge.generators import create_markdown_loader

loader = create_markdown_loader(max_chunk_size=2000)

# Load single file
chunks = loader.load("./README.md")
print(f"Created {len(chunks)} chunks")

# Load directory
chunks = loader.load("./docs/")
print(f"Created {len(chunks)} chunks from all .md files")

Custom Chunk Sizes

# Small chunks for focused content
loader = create_markdown_loader(
    max_chunk_size=500,
    min_chunk_size=50,
)

# Large chunks for comprehensive sections
loader = create_markdown_loader(
    max_chunk_size=4000,
    min_chunk_size=500,
)

Header-Based Splitting

# Split only on H1 and H2
loader = create_markdown_loader(
    header_levels=[1, 2],
)

# Split on all header levels
loader = create_markdown_loader(
    header_levels=[1, 2, 3, 4, 5, 6],
)

With Generator

from fair_forge.generators import BaseGenerator, create_markdown_loader
from langchain_groq import ChatGroq

# Create loader
loader = create_markdown_loader(
    max_chunk_size=2000,
    header_levels=[1, 2, 3],
)

# Preview chunks
chunks = loader.load("./docs/api.md")
print(f"Will generate from {len(chunks)} chunks:")
for chunk in chunks:
    print(f"  - {chunk.chunk_id}: {len(chunk.content)} chars")

# Use with generator
model = ChatGroq(model="llama-3.1-8b-instant")
generator = BaseGenerator(model=model, use_structured_output=True)

datasets = await generator.generate_dataset(
    context_loader=loader,
    source="./docs/api.md",
    assistant_id="api-assistant",
    num_queries_per_chunk=3,
)

Document Structure Best Practices

Good Structure

# Product Documentation

Overview content here...

## Getting Started

Introduction to getting started...

### Installation

Step-by-step installation guide...

### Configuration

Configuration options...

## API Reference

API documentation...

### Authentication

Auth details...
This creates logical, well-sized chunks.

Avoid

# Everything in One Section

Very long content without any headers...
thousands of lines...
no structure...
This results in one huge chunk or arbitrary splitting.

Supported Formats

FormatExtensionSupport
Markdown.mdFull support
MDX.mdxParsed as markdown

Custom Loaders

Create custom loaders for other formats:
from fair_forge.generators.context_loaders.base import BaseContextLoader
from fair_forge.generators.schemas import ContentChunk

class CustomLoader(BaseContextLoader):
    def load(self, source: str) -> list[ContentChunk]:
        # Your loading logic
        chunks = []

        # Parse your content
        content = self._read_content(source)
        sections = self._split_into_sections(content)

        for i, section in enumerate(sections):
            chunks.append(ContentChunk(
                chunk_id=f"section_{i}",
                content=section,
                metadata={"source": source},
            ))

        return chunks

# Use with generator
datasets = await generator.generate_dataset(
    context_loader=CustomLoader(),
    source="./data.custom",
    assistant_id="my-assistant",
    num_queries_per_chunk=3,
)

Next Steps