Prompt Evaluator
PromptEvaluator scores a system prompt by observing how stable its response distribution is across repeated sampling. Instead of a single LLM judge call, it runs the prompt K times per query and computes two objective, reproducible signals: CSR (Consistency/Stability Rate) and Stability (inverted Semantic Entropy). When reference responses are present in the dataset, a third signal — RSS (Reference Similarity Score) — is computed automatically. An optional LLM judge (JQ) can be added on demand.Overview
- CSR — Fraction of K responses that fall in the dominant semantic cluster. High CSR means the prompt reliably produces the same meaning.
- Stability —
1 - SE_n, where SE_n is the normalized Semantic Entropy over the cluster distribution. High stability means the distribution of meanings is focused, not scattered. - RSS — Average cosine similarity between K generated responses and the reference response. Enabled automatically when
ground_truth_assistantis present in the dataset. - ICR — Average fraction of verifiable prompt constraints satisfied across K responses. Enabled automatically when
constraintsare passed. No additional model calls — purely deterministic. - JQ — LLM-as-judge score averaged over K responses. Enabled via
jq_enabled=True.
How it works
The model must be configured with temperature > 0 for K samples to vary. With
temperature=0, every response is identical → CSR = 1.0 always, which gives no diagnostic signal.Installation
Basic Usage
Parameters
Required Parameters
| Parameter | Type | Description |
|---|---|---|
retriever | Type[Retriever] | Data source class supplying the dataset |
model | BaseChatModel | LangChain-compatible model — used as executor and optional JQ judge |
seed_prompt | str | The system prompt under evaluation |
embedder | Embedder | Embedding model for semantic clustering and RSS |
Optional Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
k | int | 10 | Number of samples generated per query |
tau | float | 0.80 | Cosine similarity threshold for semantic clustering |
constraints | list[Constraint] | None | Verifiable prompt constraints — activates ICR |
statistical_mode | StatisticalMode | FrequentistMode() | Statistical computation mode |
jq_enabled | bool | False | Enable LLM-as-judge scoring |
objective | str | "" | Judge criteria — required when jq_enabled=True |
executor | Executor | LLM invocation | Custom callable (prompt, query, context) → str |
verbose | bool | False | Enable verbose logging |
Output Schema
PromptEvaluatorMetric
QuerySampleMetrics
Interpretation
CSR — Consistency/Stability Rate
| Value | Interpretation |
|---|---|
| 0.9–1.0 | Excellent — the prompt produces the same meaning almost every run |
| 0.7–0.9 | Good — occasional variation but a clear dominant response |
| 0.5–0.7 | Moderate — noticeable semantic scatter, review prompt clarity |
| < 0.5 | Poor — responses are scattered across many different meanings |
Stability — 1 - Semantic Entropy
| Value | Interpretation |
|---|---|
| 0.9–1.0 | Excellent — responses cluster tightly, very focused distribution |
| 0.7–0.9 | Good — distribution has a clear mode |
| 0.5–0.7 | Moderate — multiple competing meaning groups |
| < 0.5 | Poor — high entropy, unpredictable output distribution |
Comparing Two Prompts
| Prompt A | Prompt B | Conclusion |
|---|---|---|
| CSR=0.9, Stability=0.85 | CSR=0.6, Stability=0.5 | A is more consistent and focused |
| CSR=0.9 (no RSS) | CSR=0.9, RSS=0.3 | A is consistent but may be consistently wrong |
| CSR=0.6, RSS=0.8 | CSR=0.9, RSS=0.4 | A’s varied responses are closer to the reference |
Clustering Threshold τ
Thetau parameter controls how similar two responses must be to be considered semantically equivalent.
| Range | Behavior | When to use |
|---|---|---|
< 0.70 | Very permissive — groups unrelated responses together | Not recommended |
0.75–0.85 | Balanced — separates paraphrases from different content | Recommended default |
0.85–0.92 | Strict — only nearly identical responses cluster together | High-precision evaluations |
> 0.92 | Very strict — almost all responses are separate clusters | Not recommended |
tau=0.80.
Choosing K
| K | Quality | Cost | Recommendation |
|---|---|---|---|
| ≤3 | Unstable — mode can flip with a single different sample | Low | Sanity checks only |
| 5–10 | Reasonable for prompts with clear behavior | Moderate | Default |
| 20–40 | Stable — reliable CSR and SE estimates | High | Comparing similar prompts |
| > 40 | Minimal marginal gain | Very high | Rarely justified |
Data Requirements
The
assistant field in each Batch is not used — PromptEvaluator generates its own responses by running seed_prompt against each query. ground_truth_assistant activates RSS automatically when present.Statistical Modes
- Frequentist (default)
- Bayesian
Computes CSR as
n_dominant / K — a direct point estimate. Fast and simple.Instruction Compliance Rate (ICR)
ICR measures whether the model’s responses satisfy programmatically verifiable constraints embedded in the prompt — things like responding in JSON, staying under a word limit, or including a required keyword. It is computed deterministically: no extra model calls.When to use it
Use ICR when your prompt tells the model how to respond — not just what to say, but in what format, length, or structure. If your prompt contains instructions like “always respond in JSON”, “keep your answer under 100 words”, or “always include the word ‘source’”, those are verifiable constraints and ICR will measure whether the model follows them consistently. If your prompt has no explicit formatting rules, skip ICR — CSR and RSS already cover what you need. Pass a list ofConstraint objects via the constraints parameter. ICR activates automatically and appears in print(result).
Built-in Constraints
| Constraint | What it checks |
|---|---|
JsonConstraint() | Response is valid JSON |
WordCountConstraint(max_words) | Response has at most N words |
KeywordConstraint(keyword) | Response contains the keyword (case-insensitive by default) |
RegexConstraint(pattern) | Response matches a regular expression |
ICR is averaged across K responses per query, then averaged across queries. A score of
0.8 means the model satisfies 80% of the provided constraints on average.Custom Executor
Replace the default LLM invocation with your own function — useful for systems with custom API wrappers, tool injection, or RAG pipelines:LLM Provider Options
Best Practices
Always use temperature > 0
Always use temperature > 0
CSR and Stability are distributional metrics — they need variance across K samples. With
temperature=0, every response is identical and CSR = 1.0 regardless of prompt quality. A value of 0.7 is a safe default for most providers.Include ground_truth_assistant to get RSS
Include ground_truth_assistant to get RSS
RSS is the only signal that detects consistent incorrectness — a prompt that consistently answers wrong will have high CSR but low RSS. Always provide reference responses when correctness matters.
Start with K=10, increase if prompts are similar
Start with K=10, increase if prompts are similar
K=10 is sufficient for most evaluations. When comparing two prompts that differ subtly in wording, use K=20–40 so that small distributional differences are reliably detected.
Tune τ for your domain
Tune τ for your domain
Conversational or creative tasks benefit from a lower τ (0.75) since style variability is expected. Structured output tasks (JSON, SQL) benefit from a higher τ (0.85–0.90) since responses should be nearly identical when correct.
Use ICR for prompts with explicit formatting rules
Use ICR for prompts with explicit formatting rules
If your prompt instructs the model to respond in JSON, limit word count, or include specific keywords, encode these as constraints and pass them to
PromptEvaluator. ICR will tell you whether the model reliably follows those rules — at zero extra cost since it is purely deterministic.Use JQ sparingly — it adds LLM calls
Use JQ sparingly — it adds LLM calls
jq_enabled=True calls the judge model K times per query. With K=10 and 20 queries, that is 200 additional judge calls per evaluation. Reserve it for cases where RSS alone is insufficient (e.g., no reference responses or nuanced quality criteria).Troubleshooting
CSR is always 1.0
CSR is always 1.0
The model temperature is likely 0. Set
temperature=0.7 (or higher) on the model before passing it to PromptEvaluator.RSS is None even though I have ground truth
RSS is None even though I have ground truth
Check that
ground_truth_assistant is not an empty string in your Batch objects. PromptEvaluator treats empty strings as absent references and skips RSS.Stability is 0 for every query
Stability is 0 for every query
All K responses are in separate clusters — every response has a completely different meaning. This indicates a severely underspecified or ambiguous prompt. Lower τ slightly (try 0.70) to confirm, or inspect the raw responses with
verbose=True.Evaluation is very slow
Evaluation is very slow
Each query makes K model calls. With K=10 and 50 queries that is 500 calls. Reduce K (try K=5) for fast iteration, then increase for final evaluation. You can also parallelize queries by subclassing the executor.
Next Steps
Statistical Modes
Deep dive into Frequentist vs Bayesian approaches
Prompt Optimizer
Automatically improve prompts using GEPA or MIPROv2
Best Of
Compare multiple prompt variants head-to-head