MIPROv2
MIPROv2 (Multiprompt Instruction PRoposal Optimizer v2) optimizes two things simultaneously: the system prompt instruction and the few-shot examples embedded in it. It uses Bayesian Optimization (Optuna/TPE) to efficiently search through combinations without testing all of them.How it works
Phase 1 — ProposalInstructionProposergenerates N instruction variants from the seed prompt, each emphasizing a different aspect (conciseness, format, grounding, tone, etc.)DemoBootstrappercreates M sets of few-shot examples by sampling from the dataset
- Each trial picks a combination:
(instruction_i, demo_set_j) - Evaluates it on a minibatch of the dataset
- Optuna/TPE models which combinations are most promising and prioritizes those
(instruction, demo_set) pair assembled into a ready-to-use system prompt.
When to use MIPROv2 over GEPA
MIPROv2 shines when format and style matter as much as content — situations where seeing worked examples teaches the model what to do better than instructions alone. A technical troubleshooting bot that must follow a 4-section structure, a support bot that must respond step-by-step, or a classification agent that must match a specific output format.Installation
Basic Usage
Parameters
Required
| Parameter | Type | Description |
|---|---|---|
retriever | Type[Retriever] | Data source class returning list[Dataset] |
model | BaseChatModel | LangChain-compatible model for instruction generation and evaluation |
seed_prompt | str | Current system prompt to improve |
objective | str | Plain language description of what a good response looks like |
Optional
| Parameter | Type | Default | Description |
|---|---|---|---|
executor | Callable | Default executor | Function that calls your agent: (prompt, query, context) → str |
evaluator | Callable | LLMEvaluator | Function that scores a response: (actual, expected, query, context) → float |
num_candidates | int | 10 | Instruction variants to generate in Phase 1 |
num_trials | int | 20 | Bayesian optimization trials (combinations to evaluate) |
minibatch_size | int | 25 | Examples per trial (uses full dataset if smaller) |
max_demos_per_set | int | 3 | Maximum few-shot examples per demo set |
num_demo_sets | int | 5 | Demo set variants to generate in Phase 1 |
random_seed | int | 42 | Seed for reproducibility |
tips | list[str] | Built-in list | Focus areas used to guide instruction generation (e.g. “Be concise”, “Specify output format”) |
instruction_proposal_system | str | Built-in prompt | System prompt used when calling the LLM to generate instruction candidates |
instruction_proposal_user | str | Built-in prompt | User prompt template for instruction generation (must contain {seed_prompt}, {objective}, {n}, {tips}) |
Output Schema
MIPROv2Result
Inspecting results
Score interpretation
| Score | Interpretation |
|---|---|
| 0.8–1.0 | Excellent — agent consistently meets the objective |
| 0.6–0.8 | Good — minor deviations |
| 0.4–0.6 | Moderate — frequent misses |
| 0.0–0.4 | Poor — prompt is clearly inadequate |
The score is the average across all examples of what the evaluator returns (0.0–1.0). With the default
LLMEvaluator, it represents how well the agent follows the objective as judged by the LLM.Custom Executor
If your agent is more than a direct model call (has memory, tools, an API), pass a custom executor:LLM Provider Options
Best Practices
Choose tasks where examples matter
Choose tasks where examples matter
MIPROv2 is most valuable when the expected output follows a specific format or style that’s hard to fully specify in instructions. A structured diagnosis template, a numbered step-by-step response, or a classification with a fixed schema are ideal cases.
Write a specific objective
Write a specific objective
The
objective drives both instruction generation and LLM-based evaluation. Include the output format you expect:Ground truth is the few-shot source
Ground truth is the few-shot source
MIPROv2 builds demo sets from your
ground_truth_assistant field. The quality of your ground truth directly determines the quality of the few-shot examples the optimizer can select. Write ideal responses in your dataset.Tune num_trials vs speed
Tune num_trials vs speed
More trials means better coverage of the search space but slower execution. With
num_candidates=10 and num_demo_sets=5 there are 50 possible combinations. num_trials=20 covers 40% of the space guided by Bayesian search, which is usually enough.Next Steps
GEPA
Iterative prompt improvement from failures
Retriever
Build a Retriever to load your evaluation dataset