Vision Metrics
The Vision metrics evaluate how accurately a Vision Language Model (VLM) describes scenes compared to human-annotated ground truth. They use a pluggable SimilarityScorer to compare free-text descriptions — no structured output required. The default scorer uses cosine similarity between sentence embeddings (all-mpnet-base-v2).
Two metrics are available:
| Metric | What it measures |
|---|---|
| VisionSimilarity | Average semantic closeness between VLM descriptions and ground truth across all frames |
| VisionHallucination | How often the VLM describes something significantly different from what actually happened |
Installation
Basic Usage
Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
retriever | Type[Retriever] | — | Data source class |
threshold | float | 0.75 | Similarity cutoff for hallucination detection |
scorer | SimilarityScorer | CosineSimilarity(SentenceTransformerEmbedder("all-mpnet-base-v2")) | Strategy used to compare descriptions |
verbose | bool | False | Enable verbose logging |
Threshold Guide
VisionSimilarity uses the threshold only for the VisionHallucination metric’s flag logic. In VisionSimilarity, the threshold parameter is stored but does not affect the similarity scores — every frame is scored regardless.Data Requirements
EachBatch only needs two text fields:
assistant— free-text description produced by the VLMground_truth_assistant— human-annotated description of what actually happened
Batch fields (agentic, ground_truth_agentic, etc.) are ignored.
Each
Dataset represents one camera session. Multiple sessions (e.g. different cameras) are returned as separate items in the list — each produces its own metric result.Output Schema
VisionSimilarityMetric
VisionHallucinationMetric
display()
Both metric types expose adisplay() method for quick inspection:
Interpretation
VisionSimilarity Scores
| Score | Interpretation |
|---|---|
| 0.90–1.00 | Near-identical descriptions — VLM is highly accurate |
| 0.75–0.89 | Semantically close — minor wording differences |
| 0.50–0.74 | Partial match — key details missing or altered |
| < 0.50 | Low similarity — VLM description diverges significantly |
VisionHallucination Rate
| Rate | Interpretation |
|---|---|
| < 10% | ✅ Reliable — VLM describes scenes accurately |
| 10–30% | ⚠️ Review needed — some frames are missed or fabricated |
| > 30% | ❌ Unreliable — VLM frequently hallucinates scene content |
Best Practices
Choose qa_id as a timestamp or frame ID
Choose qa_id as a timestamp or frame ID
Use a meaningful
qa_id such as "2026-03-17T14:00:00Z" or "cam1_frame_0042" so results are traceable back to the original footage.Tune the threshold per domain
Tune the threshold per domain
Security surveillance descriptions tend to be terse and factual — a threshold of
0.75 works well. For rich narrative descriptions (e.g. accessibility assistance), consider lowering to 0.65 to allow more paraphrasing.Run both metrics together
Run both metrics together
VisionSimilarity tells you the average quality across all frames. VisionHallucination tells you which specific frames are problematic. Run both to get a complete picture.Segment by session for per-camera analysis
Segment by session for per-camera analysis
Each
Dataset maps to one session result. Use session_id to identify individual cameras, shifts, or recording periods and compare performance across them.Troubleshooting
Similarity scores are lower than expected
Similarity scores are lower than expected
The default model (
all-mpnet-base-v2) is sensitive to domain-specific vocabulary. If your VLM uses technical terms not well represented in the model, pass a custom scorer with a domain-adapted embedder:All frames flagged as hallucinations
All frames flagged as hallucinations
Your threshold may be too strict for the description style. Print the raw similarity scores from
VisionSimilarity first to calibrate before running VisionHallucination.Model download is slow on first run
Model download is slow on first run
all-mpnet-base-v2 (~420 MB) is downloaded from HuggingFace on the first run and cached locally. Subsequent runs use the cache.Next Steps
Agentic Metric
Evaluate tool use and multi-step reasoning in AI agents
Context Metric
Measure how well responses align with a given context
AWS Lambda
Deploy Vision metrics as a serverless function