Skip to main content

Prompt Optimizer

The Prompt Optimizer module closes the evaluation loop: once Fair Forge measures where your agent fails, these tools automatically find the system prompt that fixes those failures.

How it fits into Fair Forge

Fair Forge measures metrics (Context, Conversational, etc.)
    ↓ failing examples become your dataset
GEPAOptimizer or MIPROv2Optimizer finds the best prompt
    ↓ optimized prompt deployed to your agent
Fair Forge measures again to confirm improvement

Available Optimizers

GEPA

Iteratively reads failures and generates improved prompt candidates. Best when the prompt itself is clearly wrong.

MIPROv2

Optimizes instruction AND few-shot examples simultaneously using Bayesian search. Best when format and tone matter as much as content.

When to use each

GEPAMIPROv2
OptimizesPrompt instructionInstruction + few-shot examples
Search strategyIterative, reads failuresBayesian (Optuna/TPE)
Best forClearly bad promptsFormat-sensitive tasks
SpeedFast (few iterations)Slower (20+ trials)
Examples neededNoYes — key differentiator

Installation

uv add "alquimia-fair-forge"
uv add langchain-groq  # or your preferred LLM provider

Common Pattern

Both optimizers follow the same Fair Forge pattern:
from fair_forge import Retriever
from fair_forge.schemas import Dataset

class MyRetriever(Retriever):
    def load_dataset(self) -> list[Dataset]:
        # Return your evaluation dataset
        ...

result = AnyOptimizer.run(
    retriever=MyRetriever,
    model=model,
    seed_prompt="Your current (bad) system prompt.",
    objective="Plain language description of what a good response looks like.",
)

print(result.optimized_prompt)
The objective is the most important parameter. Describe what a good response looks like — the optimizer uses it to evaluate candidates and guide generation.

Custom Evaluator

By default both optimizers use an LLM judge based on the objective. For structured or deterministic tasks, pass a custom evaluator for sharper signal:
from fair_forge.prompt_optimizer import LLMEvaluator

# Default — describe criteria in natural language
evaluator = LLMEvaluator(
    model=model,
    criteria="Response must use only the provided context and be concise.",
)

# Custom — deterministic logic
def my_evaluator(actual: str, expected: str, query: str, context: str) -> float:
    # Return a float between 0.0 and 1.0
    ...

Output

print(f"Score: {result.initial_score:.2f}{result.final_score:.2f}  ({result.n_examples} examples)")
print(result.optimized_prompt)