Regulatory Metric

The Regulatory metric evaluates whether AI assistant responses comply with a regulatory corpus (e.g., company policies, legal frameworks, compliance documents). It accumulates per-interaction compliance scores and emits one session-level result. The interactions list preserves per-QA verdicts for auditing.

Overview

Compliance Score: Weighted mean of per-interaction scores across the session (0.0–1.0)
Verdict: Session-level COMPLIANT, NON_COMPLIANT, or IRRELEVANT — derived directly from the aggregated score for consistency
Per-interaction detail: Each QA pair’s verdict, chunks, and insight accessible via interactions
Bayesian mode: Bootstrapped credible interval around the session compliance score

How It Works

1. Load regulatory corpus (markdown files)
2. Chunk documents with configurable size/overlap
3. For each interaction:
   a. Retrieve relevant chunks via embeddings (user query + agent response)
   b. Rerank chunks → classify as SUPPORTS or CONTRADICTS
   c. compliance_score = supporting / (supporting + contradicting)
   d. verdict = COMPLIANT / NON_COMPLIANT / IRRELEVANT
4. Session aggregate: weighted mean of per-interaction scores
5. Session verdict derived from the aggregated score

How the Verdict Is Determined

Condition	Verdict
All interactions IRRELEVANT	`IRRELEVANT`
`compliance_score >= compliance_threshold`	`COMPLIANT`
`compliance_score < compliance_threshold`	`NON_COMPLIANT`

The session verdict is always derived from the session compliance_score — they always agree.

By default, the metric uses Qwen3-Embedding for semantic retrieval and Qwen3-Reranker for contradiction detection via the QwenEmbedder and QwenReranker implementations. You can swap these for any custom Embedder or Reranker implementation.

Installation

uv add "alquimia-fair-forge[regulatory]"

Basic Usage

from fair_forge.connectors import LocalCorpusConnector
from fair_forge.embedders import QwenEmbedder
from fair_forge.metrics.regulatory import Regulatory
from fair_forge.rerankers import QwenReranker
from your_retriever import ConversationRetriever

corpus_connector = LocalCorpusConnector("path/to/regulations/")

metrics = Regulatory.run(
    ConversationRetriever,
    corpus_connector=corpus_connector,
    embedder=QwenEmbedder(),
    reranker=QwenReranker(),
)

for metric in metrics:
    print(f"Session: {metric.session_id}")
    print(f"  Verdict:           {metric.verdict}")
    print(f"  Compliance score:  {metric.compliance_score:.1%}")
    print(f"  Interactions:      {metric.n_interactions}")
    print(f"  Total supporting:  {metric.total_supporting_chunks}")
    print(f"  Total contradicting: {metric.total_contradicting_chunks}")

    for interaction in metric.interactions:
        icon = {"COMPLIANT": "✅", "NON_COMPLIANT": "❌", "IRRELEVANT": "⚠️"}[interaction.verdict]
        print(f"  {icon} [{interaction.qa_id}] {interaction.verdict}  score={interaction.compliance_score:.1%}")

Parameters

Required Parameters

Parameter	Type	Description
`retriever`	`Type[Retriever]`	Data source class returning conversations to evaluate
`corpus_connector`	`CorpusConnector`	Connector for loading regulatory documents
`embedder`	`Embedder`	Embedder instance for encoding documents and queries (e.g., `QwenEmbedder`)
`reranker`	`Reranker`	Reranker instance for scoring document-response alignment (e.g., `QwenReranker`)

Optional Parameters

Parameter	Type	Default	Description
`statistical_mode`	`StatisticalMode`	`FrequentistMode()`	Statistical computation mode
`chunk_size`	`int`	`1000`	Characters per chunk
`chunk_overlap`	`int`	`100`	Character overlap between chunks
`top_k`	`int`	`10`	Maximum chunks to retrieve per query
`similarity_threshold`	`float`	`0.3`	Minimum cosine similarity for retrieval
`contradiction_threshold`	`float`	`0.6`	Reranker score below which a chunk is classified as contradicting
`compliance_threshold`	`float`	`0.5`	Minimum session compliance score to emit COMPLIANT verdict
`verbose`	`bool`	`False`	Enable verbose logging

Statistical Modes

Frequentist
Bayesian

Returns the weighted mean of per-interaction compliance scores. CI fields are None.

metric.compliance_score          # 0.78
metric.compliance_score_ci_low   # None
metric.compliance_score_ci_high  # None

Bootstraps the weighted mean to produce a credible interval — useful for communicating uncertainty when a session has few interactions.

metric.compliance_score          # 0.78
metric.compliance_score_ci_low   # 0.58
metric.compliance_score_ci_high  # 0.93

A wide CI (e.g. [0.30, 0.95]) means the true compliance rate is highly uncertain — collect more data before making compliance decisions.

Interaction Weights

Each Batch can carry an optional weight to control its contribution to the session aggregate:

# Weight high-risk interactions more heavily
Batch(qa_id="billing_dispute",  ..., weight=0.5),   # Critical compliance area
Batch(qa_id="general_inquiry",  ..., weight=0.3),
Batch(qa_id="small_talk",       ..., weight=0.2),

Case	Behavior
All weights provided, sum = 1.0	Used as-is
All weights provided, sum ≠ 1.0	Warning emitted, equal weights applied
Some weights provided	Remaining weight split equally among unweighted
No weights provided	Equal weights (1/n each)

Output Schema

RegulatoryMetric

class RegulatoryMetric(BaseMetric):
    session_id: str
    assistant_id: str
    n_interactions: int                    # Number of interactions evaluated
    compliance_score: float                # Weighted mean compliance score (0.0-1.0)
    compliance_score_ci_low: float | None  # Lower credible bound — Bayesian only
    compliance_score_ci_high: float | None # Upper credible bound — Bayesian only
    verdict: Literal["COMPLIANT", "NON_COMPLIANT", "IRRELEVANT"]
    total_supporting_chunks: int           # Sum across all interactions
    total_contradicting_chunks: int        # Sum across all interactions
    interactions: list[RegulatoryInteraction]  # Per-QA detail

RegulatoryInteraction

class RegulatoryInteraction(BaseModel):
    qa_id: str
    query: str
    assistant: str
    compliance_score: float                # supporting / (supporting + contradicting)
    verdict: Literal["COMPLIANT", "NON_COMPLIANT", "IRRELEVANT"]
    supporting_chunks: int
    contradicting_chunks: int
    retrieved_chunks: list[RegulatoryChunk]  # Chunk-level evidence
    insight: str                             # Human-readable explanation

RegulatoryChunk

class RegulatoryChunk(BaseModel):
    text: str              # Chunk content
    source: str            # Source document filename
    chunk_index: int       # Position in source document
    similarity: float      # Cosine similarity from retrieval (0-1)
    reranker_score: float  # Reranker score (higher = supports)
    verdict: Literal["SUPPORTS", "CONTRADICTS"]

Corpus Connectors

LocalCorpusConnector

from fair_forge.connectors import LocalCorpusConnector

connector = LocalCorpusConnector("path/to/corpus/")
documents = connector.load_documents()
print(f"Loaded {len(documents)} documents")

LakeFSCorpusConnector

from fair_forge.connectors.lakefs import LakeFSCorpusConnector

connector = LakeFSCorpusConnector(
    host="https://lakefs.example.com",
    username="your-username",
    password="your-password",
    repo_id="regulations",
    corpus_prefix="compliance/",
    branch_name="main",
)

Complete Example

from fair_forge.connectors import LocalCorpusConnector
from fair_forge.core.retriever import Retriever
from fair_forge.embedders import QwenEmbedder
from fair_forge.metrics.regulatory import Regulatory
from fair_forge.rerankers import QwenReranker
from fair_forge.statistical import BayesianMode
from fair_forge.schemas.common import Dataset, Batch

class ComplianceRetriever(Retriever):
    def load_dataset(self) -> list[Dataset]:
        return [
            Dataset(
                session_id="compliance_test_001",
                assistant_id="callcenter_bot",
                language="english",
                context="Call center regulatory compliance",
                conversation=[
                    Batch(
                        qa_id="qa_001",
                        query="I don't want any more calls from you!",
                        assistant="I've immediately added your number to our do-not-call list.",
                        ground_truth_assistant="Add to do-not-call list",
                        weight=0.6,  # High-stakes interaction
                    ),
                    Batch(
                        qa_id="qa_002",
                        query="Can I get a refund? I bought this 45 days ago.",
                        assistant="Our standard refund policy is 30 days. I can offer store credit instead.",
                        ground_truth_assistant="Explain 30-day limit, offer alternatives",
                        weight=0.4,
                    ),
                ],
            ),
        ]

corpus_connector = LocalCorpusConnector("./regulations/")

metrics = Regulatory.run(
    ComplianceRetriever,
    corpus_connector=corpus_connector,
    embedder=QwenEmbedder(),
    reranker=QwenReranker(),
    statistical_mode=BayesianMode(mc_samples=5000, ci_level=0.95),
    compliance_threshold=0.5,
    verbose=True,
)

for metric in metrics:
    print(f"Session: {metric.session_id}")
    ci = f"  [{metric.compliance_score_ci_low:.1%}, {metric.compliance_score_ci_high:.1%}]" \
         if metric.compliance_score_ci_low is not None else ""
    print(f"  Compliance: {metric.compliance_score:.1%}{ci}")
    print(f"  Verdict:    {metric.verdict}")
    print(f"  Evidence:   {metric.total_supporting_chunks} supporting, {metric.total_contradicting_chunks} contradicting")
    print()

    icon = {"COMPLIANT": "✅", "NON_COMPLIANT": "❌", "IRRELEVANT": "⚠️"}
    for interaction in metric.interactions:
        print(f"  {icon[interaction.verdict]} [{interaction.qa_id}] {interaction.verdict}  score={interaction.compliance_score:.1%}")
        print(f"     {interaction.insight}")
        for chunk in interaction.retrieved_chunks[:2]:
            print(f"     [{chunk.verdict}] {chunk.source}  sim={chunk.similarity:.2f}  rerank={chunk.reranker_score:.2f}")

Regulatory Corpus Format

Create markdown files in your corpus directory:

# Call Center Policy

## Do-Not-Call Regulations

### Customer Rights
- All customers have the right to request removal from call lists at any time
- Requests must be honored within 24 hours of receipt
- No calls between 9:00 PM and 8:00 AM local time

## Refund Policy
- Full refunds available within 30 days of purchase with original receipt
- Refunds processed within 5-7 business days

Interpretation

Compliance Scores

Score Range	Interpretation
0.9–1.0	Excellent — strong regulatory support
0.7–0.9	Good — mostly compliant
0.5–0.7	Moderate — mixed signals, review recommended
0.3–0.5	Poor — potential compliance issues
0.0–0.3	Critical — clear regulatory violations

Model Options

Size	Embedding Model	Reranker Model
Small (0.6B)	`Qwen/Qwen3-Embedding-0.6B`	`Qwen/Qwen3-Reranker-0.6B`
Medium (4B)	`Qwen/Qwen3-Embedding-4B`	`Qwen/Qwen3-Reranker-4B`
Large (8B)	`Qwen/Qwen3-Embedding-8B`	`Qwen/Qwen3-Reranker-8B`

Threshold Tuning

Similarity Threshold (0.3 default)

Controls which chunks are retrieved based on semantic similarity.

Lower (0.2): Retrieves more chunks, may include less relevant ones
Higher (0.5): Stricter, only highly relevant chunks

Lower this if you’re getting too many IRRELEVANT verdicts.

Contradiction Threshold (0.6 default)

Controls how the reranker classifies chunks as SUPPORTS or CONTRADICTS.

Lower (0.4): Stricter — more chunks classified as contradicting
Higher (0.8): Lenient — only clear contradictions flagged

Lower this for stricter compliance checking.

Compliance Threshold (0.5 default)

Minimum session compliance score to emit a COMPLIANT verdict.

Lower (0.3): Lenient — sessions pass with fewer supporting chunks
Higher (0.7): Strict — requires clear majority of supporting evidence

Raise this for high-stakes regulatory environments.

Troubleshooting

Getting Too Many IRRELEVANT Verdicts

Lower similarity_threshold to 0.2 or verify the corpus covers the topics being discussed.

Most Responses Marked NON_COMPLIANT

Raise contradiction_threshold to 0.7 or 0.8 and review whether the corpus is balanced (not just prohibitions).

Verdict and Compliance Score Disagree

This should not happen with the current implementation — verdict is always derived from compliance_score. If you see a discrepancy, file a bug report.

Out of Memory Errors

Use smaller models (0.6B), reduce batch_size to 16 or 8, or fall back to CPU.

Use Cases

Financial Compliance

Verify responses comply with banking regulations, KYC requirements, and financial advice rules

Healthcare HIPAA

Ensure patient data handling follows HIPAA guidelines

Call Center Policies

Check responses against company policies and consumer protection laws

Legal Compliance

Validate AI-generated legal content against jurisdiction-specific regulations

Next Steps

Statistical Modes

Frequentist vs Bayesian — when each matters

Context Metric

Evaluate response alignment with system context

AWS Lambda

Deploy Regulatory as a serverless function

Documentation Index

​Regulatory Metric

​Overview

​How It Works

​How the Verdict Is Determined

​Installation

​Basic Usage

​Parameters

​Required Parameters

​Optional Parameters

​Statistical Modes

​Interaction Weights

​Output Schema

​RegulatoryMetric

​RegulatoryInteraction

​RegulatoryChunk

​Corpus Connectors

​LocalCorpusConnector

​LakeFSCorpusConnector

​Complete Example

​Regulatory Corpus Format

​Interpretation

​Compliance Scores

​Model Options

​Threshold Tuning

​Troubleshooting

​Use Cases

Financial Compliance

Healthcare HIPAA

Call Center Policies

Legal Compliance

​Next Steps

Statistical Modes

Context Metric

AWS Lambda

Regulatory Metric

Overview

How It Works

How the Verdict Is Determined

Installation

Basic Usage

Parameters

Required Parameters

Optional Parameters

Statistical Modes

Interaction Weights

Output Schema

RegulatoryMetric

RegulatoryInteraction

RegulatoryChunk

Corpus Connectors

LocalCorpusConnector

LakeFSCorpusConnector

Complete Example

Regulatory Corpus Format

Interpretation

Compliance Scores

Model Options

Threshold Tuning

Troubleshooting

Use Cases

Next Steps