Skip to main content

Prompt Evaluator

PromptEvaluator scores a system prompt by observing how stable its response distribution is across repeated sampling. Instead of a single LLM judge call, it runs the prompt K times per query and computes two objective, reproducible signals: CSR (Consistency/Stability Rate) and Stability (inverted Semantic Entropy). When reference responses are present in the dataset, a third signal — RSS (Reference Similarity Score) — is computed automatically. An optional LLM judge (JQ) can be added on demand.

Overview

  • CSR — Fraction of K responses that fall in the dominant semantic cluster. High CSR means the prompt reliably produces the same meaning.
  • Stability1 - SE_n, where SE_n is the normalized Semantic Entropy over the cluster distribution. High stability means the distribution of meanings is focused, not scattered.
  • RSS — Average cosine similarity between K generated responses and the reference response. Enabled automatically when ground_truth_assistant is present in the dataset.
  • ICR — Average fraction of verifiable prompt constraints satisfied across K responses. Enabled automatically when constraints are passed. No additional model calls — purely deterministic.
  • JQ — LLM-as-judge score averaged over K responses. Enabled via jq_enabled=True.

How it works

For each query in the dataset:
  1. Run seed_prompt + query → K responses (with temperature > 0)
  2. Embed all K responses
  3. Cluster by cosine similarity ≥ τ (union-find)
  4. CSR    = size of dominant cluster / K
  5. SE_n   = normalized entropy over cluster distribution
  6. Stability = 1 - SE_n
  7. RSS    = avg cosine similarity to reference (if available)
  8. ICR    = avg fraction of constraints satisfied (if provided)
  9. JQ     = avg LLM judge score (if jq_enabled=True)

Final metric = mean of each signal across all queries
The model must be configured with temperature > 0 for K samples to vary. With temperature=0, every response is identical → CSR = 1.0 always, which gives no diagnostic signal.

Installation

uv add "alquimia-fair-forge[prompt-evaluator]"
uv add langchain-groq          # or your preferred LLM provider
uv add sentence-transformers   # for the default embedder

Basic Usage

from fair_forge.metrics.prompt_evaluator import PromptEvaluator
from fair_forge.embedders.sentence_transformer import SentenceTransformerEmbedder
from langchain_groq import ChatGroq
from your_retriever import MyRetriever

model = ChatGroq(model="llama-3.3-70b-versatile", api_key="your-api-key", temperature=0.7)
embedder = SentenceTransformerEmbedder(model_name="all-MiniLM-L6-v2")

metrics = PromptEvaluator.run(
    MyRetriever,
    model=model,
    seed_prompt="You are a helpful assistant. Answer using only the provided context.",
    embedder=embedder,
    k=10,
)

for m in metrics:
    print(f"Session: {m.session_id}")
    print(f"  CSR:       {m.csr:.3f}")      # Consistency
    print(f"  Stability: {m.stability:.3f}") # 1 - Semantic Entropy
    if m.rss is not None:
        print(f"  RSS:       {m.rss:.3f}")   # Reference similarity (auto)

Parameters

Required Parameters

ParameterTypeDescription
retrieverType[Retriever]Data source class supplying the dataset
modelBaseChatModelLangChain-compatible model — used as executor and optional JQ judge
seed_promptstrThe system prompt under evaluation
embedderEmbedderEmbedding model for semantic clustering and RSS

Optional Parameters

ParameterTypeDefaultDescription
kint10Number of samples generated per query
taufloat0.80Cosine similarity threshold for semantic clustering
constraintslist[Constraint]NoneVerifiable prompt constraints — activates ICR
statistical_modeStatisticalModeFrequentistMode()Statistical computation mode
jq_enabledboolFalseEnable LLM-as-judge scoring
objectivestr""Judge criteria — required when jq_enabled=True
executorExecutorLLM invocationCustom callable (prompt, query, context) → str
verboseboolFalseEnable verbose logging

Output Schema

PromptEvaluatorMetric

class PromptEvaluatorMetric(BaseMetric):
    session_id: str               # Dataset session identifier
    assistant_id: str             # Assistant identifier
    seed_prompt: str              # The evaluated prompt
    k: int                        # Samples per query
    tau: float                    # Clustering threshold used
    csr: float                    # Mean CSR over all queries
    stability: float              # Mean (1 - SE_n) over all queries
    rss: float | None             # Mean RSS — present when references available
    icr: float | None             # Mean ICR — present when constraints provided
    jq: float | None              # Mean JQ — present when jq_enabled=True
    n_queries: int                # Total queries evaluated
    interactions: list[QuerySampleMetrics]

QuerySampleMetrics

class QuerySampleMetrics(BaseModel):
    qa_id: str        # Query identifier
    k: int            # Samples generated
    csr: float        # CSR for this query
    stability: float  # Stability (1 - SE_n) for this query
    rss: float | None # RSS for this query (if reference available)
    icr: float | None # ICR for this query (if constraints provided)
    jq: float | None  # JQ for this query (if jq_enabled)
    n_clusters: int   # Number of semantic clusters found

Interpretation

CSR — Consistency/Stability Rate

ValueInterpretation
0.9–1.0Excellent — the prompt produces the same meaning almost every run
0.7–0.9Good — occasional variation but a clear dominant response
0.5–0.7Moderate — noticeable semantic scatter, review prompt clarity
< 0.5Poor — responses are scattered across many different meanings

Stability — 1 - Semantic Entropy

ValueInterpretation
0.9–1.0Excellent — responses cluster tightly, very focused distribution
0.7–0.9Good — distribution has a clear mode
0.5–0.7Moderate — multiple competing meaning groups
< 0.5Poor — high entropy, unpredictable output distribution

Comparing Two Prompts

Prompt APrompt BConclusion
CSR=0.9, Stability=0.85CSR=0.6, Stability=0.5A is more consistent and focused
CSR=0.9 (no RSS)CSR=0.9, RSS=0.3A is consistent but may be consistently wrong
CSR=0.6, RSS=0.8CSR=0.9, RSS=0.4A’s varied responses are closer to the reference
High CSR + low RSS means the prompt is consistently producing the wrong answer. Always use RSS or JQ alongside CSR/Stability when correctness matters.

Clustering Threshold τ

The tau parameter controls how similar two responses must be to be considered semantically equivalent.
RangeBehaviorWhen to use
< 0.70Very permissive — groups unrelated responses togetherNot recommended
0.75–0.85Balanced — separates paraphrases from different contentRecommended default
0.85–0.92Strict — only nearly identical responses cluster togetherHigh-precision evaluations
> 0.92Very strict — almost all responses are separate clustersNot recommended
The recommended default is tau=0.80.

Choosing K

KQualityCostRecommendation
≤3Unstable — mode can flip with a single different sampleLowSanity checks only
5–10Reasonable for prompts with clear behaviorModerateDefault
20–40Stable — reliable CSR and SE estimatesHighComparing similar prompts
> 40Minimal marginal gainVery highRarely justified
SE is more sensitive to K than CSR — use K ≥ 20 when expecting ambiguous or high-entropy prompts.

Data Requirements

from fair_forge.core.retriever import Retriever
from fair_forge.schemas.common import Dataset, Batch

class MyRetriever(Retriever):
    def load_dataset(self) -> list[Dataset]:
        return [
            Dataset(
                session_id="eval_session_001",
                assistant_id="support-bot-v2",
                language="english",
                context="Nexo offers three plans: Free (3 projects, 5 members), Pro ($15/month).",
                conversation=[
                    Batch(
                        qa_id="qa_001",
                        query="How much does the Pro plan cost?",
                        assistant="",                             # Not used by PromptEvaluator
                        ground_truth_assistant="$15 per user per month.",  # Enables RSS
                    ),
                    Batch(
                        qa_id="qa_002",
                        query="Is there a free plan?",
                        assistant="",
                        ground_truth_assistant="Yes, the Free plan supports up to 3 projects.",
                    ),
                ],
            )
        ]
The assistant field in each Batch is not used — PromptEvaluator generates its own responses by running seed_prompt against each query. ground_truth_assistant activates RSS automatically when present.

Statistical Modes

Computes CSR as n_dominant / K — a direct point estimate. Fast and simple.
# K=10, 7 responses in dominant cluster:
# CSR = 7/10 = 0.70

Instruction Compliance Rate (ICR)

ICR measures whether the model’s responses satisfy programmatically verifiable constraints embedded in the prompt — things like responding in JSON, staying under a word limit, or including a required keyword. It is computed deterministically: no extra model calls.

When to use it

Use ICR when your prompt tells the model how to respond — not just what to say, but in what format, length, or structure. If your prompt contains instructions like “always respond in JSON”, “keep your answer under 100 words”, or “always include the word ‘source’”, those are verifiable constraints and ICR will measure whether the model follows them consistently. If your prompt has no explicit formatting rules, skip ICR — CSR and RSS already cover what you need. Pass a list of Constraint objects via the constraints parameter. ICR activates automatically and appears in print(result).

Built-in Constraints

ConstraintWhat it checks
JsonConstraint()Response is valid JSON
WordCountConstraint(max_words)Response has at most N words
KeywordConstraint(keyword)Response contains the keyword (case-insensitive by default)
RegexConstraint(pattern)Response matches a regular expression
from fair_forge.metrics.constraints import JsonConstraint, KeywordConstraint, WordCountConstraint
from fair_forge.metrics.prompt_evaluator import PromptEvaluator
from fair_forge.embedders.sentence_transformer import SentenceTransformerEmbedder
from langchain_groq import ChatGroq
from your_retriever import MyRetriever

model = ChatGroq(model="llama-3.3-70b-versatile", api_key="your-api-key", temperature=0.7)
embedder = SentenceTransformerEmbedder(model_name="all-MiniLM-L6-v2")

metrics = PromptEvaluator.run(
    MyRetriever,
    model=model,
    seed_prompt=(
        "Answer in valid JSON with exactly two keys: 'answer' and 'confidence'. "
        "Keep your answer under 50 words."
    ),
    embedder=embedder,
    k=10,
    constraints=[
        JsonConstraint(),                  # must be valid JSON
        WordCountConstraint(max_words=50), # must stay concise
        KeywordConstraint("confidence"),   # must include the key
    ],
)

for m in metrics:
    print(m)  # ICR appears automatically in the output
ICR is averaged across K responses per query, then averaged across queries. A score of 0.8 means the model satisfies 80% of the provided constraints on average.

Custom Executor

Replace the default LLM invocation with your own function — useful for systems with custom API wrappers, tool injection, or RAG pipelines:
def my_executor(prompt: str, query: str, context: str) -> str:
    # Your custom invocation logic here
    return my_system.call(system_prompt=prompt, user_input=query, context=context)

metrics = PromptEvaluator.run(
    MyRetriever,
    model=model,          # Still needed for JQ if enabled
    seed_prompt="...",
    embedder=embedder,
    executor=my_executor,
)

LLM Provider Options

from langchain_groq import ChatGroq

model = ChatGroq(model="llama-3.3-70b-versatile", api_key="your-api-key", temperature=0.7)

Best Practices

CSR and Stability are distributional metrics — they need variance across K samples. With temperature=0, every response is identical and CSR = 1.0 regardless of prompt quality. A value of 0.7 is a safe default for most providers.
RSS is the only signal that detects consistent incorrectness — a prompt that consistently answers wrong will have high CSR but low RSS. Always provide reference responses when correctness matters.
K=10 is sufficient for most evaluations. When comparing two prompts that differ subtly in wording, use K=20–40 so that small distributional differences are reliably detected.
Conversational or creative tasks benefit from a lower τ (0.75) since style variability is expected. Structured output tasks (JSON, SQL) benefit from a higher τ (0.85–0.90) since responses should be nearly identical when correct.
If your prompt instructs the model to respond in JSON, limit word count, or include specific keywords, encode these as constraints and pass them to PromptEvaluator. ICR will tell you whether the model reliably follows those rules — at zero extra cost since it is purely deterministic.
jq_enabled=True calls the judge model K times per query. With K=10 and 20 queries, that is 200 additional judge calls per evaluation. Reserve it for cases where RSS alone is insufficient (e.g., no reference responses or nuanced quality criteria).

Troubleshooting

The model temperature is likely 0. Set temperature=0.7 (or higher) on the model before passing it to PromptEvaluator.
Check that ground_truth_assistant is not an empty string in your Batch objects. PromptEvaluator treats empty strings as absent references and skips RSS.
All K responses are in separate clusters — every response has a completely different meaning. This indicates a severely underspecified or ambiguous prompt. Lower τ slightly (try 0.70) to confirm, or inspect the raw responses with verbose=True.
Each query makes K model calls. With K=10 and 50 queries that is 500 calls. Reduce K (try K=5) for fast iteration, then increase for final evaluation. You can also parallelize queries by subclassing the executor.

Next Steps

Statistical Modes

Deep dive into Frequentist vs Bayesian approaches

Prompt Optimizer

Automatically improve prompts using GEPA or MIPROv2

Best Of

Compare multiple prompt variants head-to-head