Regulatory Metric
The Regulatory metric evaluates whether AI assistant responses comply with a regulatory corpus (e.g., company policies, legal frameworks, compliance documents). It accumulates per-interaction compliance scores and emits one session-level result. Theinteractions list preserves per-QA verdicts for auditing.
Overview
- Compliance Score: Weighted mean of per-interaction scores across the session (0.0–1.0)
- Verdict: Session-level
COMPLIANT,NON_COMPLIANT, orIRRELEVANT— derived directly from the aggregated score for consistency - Per-interaction detail: Each QA pair’s verdict, chunks, and insight accessible via
interactions - Bayesian mode: Bootstrapped credible interval around the session compliance score
How It Works
How the Verdict Is Determined
| Condition | Verdict |
|---|---|
| All interactions IRRELEVANT | IRRELEVANT |
compliance_score >= compliance_threshold | COMPLIANT |
compliance_score < compliance_threshold | NON_COMPLIANT |
verdict is always derived from the session compliance_score — they always agree.
By default, the metric uses Qwen3-Embedding for semantic retrieval and Qwen3-Reranker for contradiction detection via the
QwenEmbedder and QwenReranker implementations. You can swap these for any custom Embedder or Reranker implementation.Installation
Basic Usage
Parameters
Required Parameters
| Parameter | Type | Description |
|---|---|---|
retriever | Type[Retriever] | Data source class returning conversations to evaluate |
corpus_connector | CorpusConnector | Connector for loading regulatory documents |
embedder | Embedder | Embedder instance for encoding documents and queries (e.g., QwenEmbedder) |
reranker | Reranker | Reranker instance for scoring document-response alignment (e.g., QwenReranker) |
Optional Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
statistical_mode | StatisticalMode | FrequentistMode() | Statistical computation mode |
chunk_size | int | 1000 | Characters per chunk |
chunk_overlap | int | 100 | Character overlap between chunks |
top_k | int | 10 | Maximum chunks to retrieve per query |
similarity_threshold | float | 0.3 | Minimum cosine similarity for retrieval |
contradiction_threshold | float | 0.6 | Reranker score below which a chunk is classified as contradicting |
compliance_threshold | float | 0.5 | Minimum session compliance score to emit COMPLIANT verdict |
verbose | bool | False | Enable verbose logging |
Statistical Modes
- Frequentist
- Bayesian
Returns the weighted mean of per-interaction compliance scores. CI fields are
None.Interaction Weights
EachBatch can carry an optional weight to control its contribution to the session aggregate:
| Case | Behavior |
|---|---|
| All weights provided, sum = 1.0 | Used as-is |
| All weights provided, sum ≠ 1.0 | Warning emitted, equal weights applied |
| Some weights provided | Remaining weight split equally among unweighted |
| No weights provided | Equal weights (1/n each) |
Output Schema
RegulatoryMetric
RegulatoryInteraction
RegulatoryChunk
Corpus Connectors
LocalCorpusConnector
LakeFSCorpusConnector
Complete Example
Regulatory Corpus Format
Create markdown files in your corpus directory:Interpretation
Compliance Scores
| Score Range | Interpretation |
|---|---|
| 0.9–1.0 | Excellent — strong regulatory support |
| 0.7–0.9 | Good — mostly compliant |
| 0.5–0.7 | Moderate — mixed signals, review recommended |
| 0.3–0.5 | Poor — potential compliance issues |
| 0.0–0.3 | Critical — clear regulatory violations |
Model Options
| Size | Embedding Model | Reranker Model |
|---|---|---|
| Small (0.6B) | Qwen/Qwen3-Embedding-0.6B | Qwen/Qwen3-Reranker-0.6B |
| Medium (4B) | Qwen/Qwen3-Embedding-4B | Qwen/Qwen3-Reranker-4B |
| Large (8B) | Qwen/Qwen3-Embedding-8B | Qwen/Qwen3-Reranker-8B |
Threshold Tuning
Similarity Threshold (0.3 default)
Similarity Threshold (0.3 default)
Controls which chunks are retrieved based on semantic similarity.
- Lower (0.2): Retrieves more chunks, may include less relevant ones
- Higher (0.5): Stricter, only highly relevant chunks
Contradiction Threshold (0.6 default)
Contradiction Threshold (0.6 default)
Controls how the reranker classifies chunks as SUPPORTS or CONTRADICTS.
- Lower (0.4): Stricter — more chunks classified as contradicting
- Higher (0.8): Lenient — only clear contradictions flagged
Compliance Threshold (0.5 default)
Compliance Threshold (0.5 default)
Minimum session compliance score to emit a COMPLIANT verdict.
- Lower (0.3): Lenient — sessions pass with fewer supporting chunks
- Higher (0.7): Strict — requires clear majority of supporting evidence
Troubleshooting
Getting Too Many IRRELEVANT Verdicts
Getting Too Many IRRELEVANT Verdicts
Lower
similarity_threshold to 0.2 or verify the corpus covers the topics being discussed.Most Responses Marked NON_COMPLIANT
Most Responses Marked NON_COMPLIANT
Raise
contradiction_threshold to 0.7 or 0.8 and review whether the corpus is balanced (not just prohibitions).Verdict and Compliance Score Disagree
Verdict and Compliance Score Disagree
This should not happen with the current implementation —
verdict is always derived from compliance_score. If you see a discrepancy, file a bug report.Out of Memory Errors
Out of Memory Errors
Use smaller models (0.6B), reduce
batch_size to 16 or 8, or fall back to CPU.Use Cases
Financial Compliance
Verify responses comply with banking regulations, KYC requirements, and financial advice rules
Healthcare HIPAA
Ensure patient data handling follows HIPAA guidelines
Call Center Policies
Check responses against company policies and consumer protection laws
Legal Compliance
Validate AI-generated legal content against jurisdiction-specific regulations
Next Steps
Statistical Modes
Frequentist vs Bayesian — when each matters
Context Metric
Evaluate response alignment with system context
AWS Lambda
Deploy Regulatory as a serverless function