GEPA
GEPA (Generative Evolutionary Prompt Adaptation) evaluates a seed prompt against your dataset, identifies the examples that fail, and generates improved candidates that address those failures. It repeats until the score stops improving or the iteration budget is exhausted.How it works
Installation
Basic Usage
Parameters
Required
| Parameter | Type | Description |
|---|---|---|
retriever | Type[Retriever] | Data source class returning list[Dataset] |
model | BaseChatModel | LangChain-compatible model used for candidate generation |
seed_prompt | str | Current system prompt to improve |
objective | str | Plain language description of what a good response looks like |
Optional
| Parameter | Type | Default | Description |
|---|---|---|---|
executor | Callable | Default executor | Function that calls your agent: (prompt, query, context) → str |
evaluator | Callable | LLMEvaluator | Function that scores a response: (actual, expected, query, context) → float |
iterations | int | 5 | Maximum number of improvement iterations |
candidates_per_iteration | int | 3 | Candidate prompts generated per iteration |
failure_threshold | float | 0.6 | Score below which an example is considered failing |
Custom Executor
By default, GEPA calls themodel directly with the candidate prompt. If your agent is more complex (has memory, tools, an API), pass a custom executor:
Custom Evaluator
For structured or deterministic tasks, a custom evaluator gives sharper signal than the LLM judge:Use a custom evaluator when the task has deterministic success criteria (valid JSON, specific fields, exact format). Use the default
LLMEvaluator when quality is subjective (tone, clarity, factual grounding).Output Schema
OptimizationResult
IterationResult
Score interpretation
The score is the average across all examples of what the evaluator returns (0.0–1.0). With the defaultLLMEvaluator, it represents how well the agent follows the objective criteria as judged by the LLM.
| Score | Interpretation |
|---|---|
| 0.8–1.0 | Excellent — agent consistently meets the objective |
| 0.6–0.8 | Good — minor deviations |
| 0.4–0.6 | Moderate — frequent misses |
| 0.0–0.4 | Poor — prompt is clearly inadequate |
LLM Provider Options
Best Practices
Write a specific objective
Write a specific objective
Vague objectives produce vague improvements. Be explicit about what the agent should and should not do:
Start with a deliberately bad seed prompt
Start with a deliberately bad seed prompt
A seed prompt like
"Eres un asistente." gives GEPA maximum room to improve and produces a more dramatic demonstration of the optimization. If your current prompt is already decent, the improvement will be smaller.Use a deterministic evaluator for structured tasks
Use a deterministic evaluator for structured tasks
If the expected output follows a strict format (JSON, numbered steps, specific fields), write a deterministic evaluator. It gives GEPA a much clearer signal than an LLM judge on whether a candidate is correct.
Dataset size
Dataset size
8–20 examples is enough for GEPA. Too few examples and the signal is noisy; too many and each iteration becomes slow. If you have many examples, consider sampling a representative subset.
Next Steps
MIPROv2
Optimize instruction AND few-shot examples simultaneously
Retriever
Build a Retriever to load your evaluation dataset