Skip to main content

Regulatory Metric

The Regulatory metric evaluates whether AI assistant responses comply with a regulatory corpus (e.g., company policies, legal frameworks, compliance documents). It accumulates per-interaction compliance scores and emits one session-level result. The interactions list preserves per-QA verdicts for auditing.

Overview

  • Compliance Score: Weighted mean of per-interaction scores across the session (0.0–1.0)
  • Verdict: Session-level COMPLIANT, NON_COMPLIANT, or IRRELEVANT — derived directly from the aggregated score for consistency
  • Per-interaction detail: Each QA pair’s verdict, chunks, and insight accessible via interactions
  • Bayesian mode: Bootstrapped credible interval around the session compliance score

How It Works

1. Load regulatory corpus (markdown files)
2. Chunk documents with configurable size/overlap
3. For each interaction:
   a. Retrieve relevant chunks via embeddings (user query + agent response)
   b. Rerank chunks → classify as SUPPORTS or CONTRADICTS
   c. compliance_score = supporting / (supporting + contradicting)
   d. verdict = COMPLIANT / NON_COMPLIANT / IRRELEVANT
4. Session aggregate: weighted mean of per-interaction scores
5. Session verdict derived from the aggregated score

How the Verdict Is Determined

ConditionVerdict
All interactions IRRELEVANTIRRELEVANT
compliance_score >= compliance_thresholdCOMPLIANT
compliance_score < compliance_thresholdNON_COMPLIANT
The session verdict is always derived from the session compliance_score — they always agree.
By default, the metric uses Qwen3-Embedding for semantic retrieval and Qwen3-Reranker for contradiction detection via the QwenEmbedder and QwenReranker implementations. You can swap these for any custom Embedder or Reranker implementation.

Installation

uv add "alquimia-fair-forge[regulatory]"

Basic Usage

from fair_forge.connectors import LocalCorpusConnector
from fair_forge.embedders import QwenEmbedder
from fair_forge.metrics.regulatory import Regulatory
from fair_forge.rerankers import QwenReranker
from your_retriever import ConversationRetriever

corpus_connector = LocalCorpusConnector("path/to/regulations/")

metrics = Regulatory.run(
    ConversationRetriever,
    corpus_connector=corpus_connector,
    embedder=QwenEmbedder(),
    reranker=QwenReranker(),
)

for metric in metrics:
    print(f"Session: {metric.session_id}")
    print(f"  Verdict:           {metric.verdict}")
    print(f"  Compliance score:  {metric.compliance_score:.1%}")
    print(f"  Interactions:      {metric.n_interactions}")
    print(f"  Total supporting:  {metric.total_supporting_chunks}")
    print(f"  Total contradicting: {metric.total_contradicting_chunks}")

    for interaction in metric.interactions:
        icon = {"COMPLIANT": "✅", "NON_COMPLIANT": "❌", "IRRELEVANT": "⚠️"}[interaction.verdict]
        print(f"  {icon} [{interaction.qa_id}] {interaction.verdict}  score={interaction.compliance_score:.1%}")

Parameters

Required Parameters

ParameterTypeDescription
retrieverType[Retriever]Data source class returning conversations to evaluate
corpus_connectorCorpusConnectorConnector for loading regulatory documents
embedderEmbedderEmbedder instance for encoding documents and queries (e.g., QwenEmbedder)
rerankerRerankerReranker instance for scoring document-response alignment (e.g., QwenReranker)

Optional Parameters

ParameterTypeDefaultDescription
statistical_modeStatisticalModeFrequentistMode()Statistical computation mode
chunk_sizeint1000Characters per chunk
chunk_overlapint100Character overlap between chunks
top_kint10Maximum chunks to retrieve per query
similarity_thresholdfloat0.3Minimum cosine similarity for retrieval
contradiction_thresholdfloat0.6Reranker score below which a chunk is classified as contradicting
compliance_thresholdfloat0.5Minimum session compliance score to emit COMPLIANT verdict
verboseboolFalseEnable verbose logging

Statistical Modes

Returns the weighted mean of per-interaction compliance scores. CI fields are None.
metric.compliance_score          # 0.78
metric.compliance_score_ci_low   # None
metric.compliance_score_ci_high  # None

Interaction Weights

Each Batch can carry an optional weight to control its contribution to the session aggregate:
# Weight high-risk interactions more heavily
Batch(qa_id="billing_dispute",  ..., weight=0.5),   # Critical compliance area
Batch(qa_id="general_inquiry",  ..., weight=0.3),
Batch(qa_id="small_talk",       ..., weight=0.2),
CaseBehavior
All weights provided, sum = 1.0Used as-is
All weights provided, sum ≠ 1.0Warning emitted, equal weights applied
Some weights providedRemaining weight split equally among unweighted
No weights providedEqual weights (1/n each)

Output Schema

RegulatoryMetric

class RegulatoryMetric(BaseMetric):
    session_id: str
    assistant_id: str
    n_interactions: int                    # Number of interactions evaluated
    compliance_score: float                # Weighted mean compliance score (0.0-1.0)
    compliance_score_ci_low: float | None  # Lower credible bound — Bayesian only
    compliance_score_ci_high: float | None # Upper credible bound — Bayesian only
    verdict: Literal["COMPLIANT", "NON_COMPLIANT", "IRRELEVANT"]
    total_supporting_chunks: int           # Sum across all interactions
    total_contradicting_chunks: int        # Sum across all interactions
    interactions: list[RegulatoryInteraction]  # Per-QA detail

RegulatoryInteraction

class RegulatoryInteraction(BaseModel):
    qa_id: str
    query: str
    assistant: str
    compliance_score: float                # supporting / (supporting + contradicting)
    verdict: Literal["COMPLIANT", "NON_COMPLIANT", "IRRELEVANT"]
    supporting_chunks: int
    contradicting_chunks: int
    retrieved_chunks: list[RegulatoryChunk]  # Chunk-level evidence
    insight: str                             # Human-readable explanation

RegulatoryChunk

class RegulatoryChunk(BaseModel):
    text: str              # Chunk content
    source: str            # Source document filename
    chunk_index: int       # Position in source document
    similarity: float      # Cosine similarity from retrieval (0-1)
    reranker_score: float  # Reranker score (higher = supports)
    verdict: Literal["SUPPORTS", "CONTRADICTS"]

Corpus Connectors

LocalCorpusConnector

from fair_forge.connectors import LocalCorpusConnector

connector = LocalCorpusConnector("path/to/corpus/")
documents = connector.load_documents()
print(f"Loaded {len(documents)} documents")

LakeFSCorpusConnector

from fair_forge.connectors.lakefs import LakeFSCorpusConnector

connector = LakeFSCorpusConnector(
    host="https://lakefs.example.com",
    username="your-username",
    password="your-password",
    repo_id="regulations",
    corpus_prefix="compliance/",
    branch_name="main",
)

Complete Example

from fair_forge.connectors import LocalCorpusConnector
from fair_forge.core.retriever import Retriever
from fair_forge.embedders import QwenEmbedder
from fair_forge.metrics.regulatory import Regulatory
from fair_forge.rerankers import QwenReranker
from fair_forge.statistical import BayesianMode
from fair_forge.schemas.common import Dataset, Batch

class ComplianceRetriever(Retriever):
    def load_dataset(self) -> list[Dataset]:
        return [
            Dataset(
                session_id="compliance_test_001",
                assistant_id="callcenter_bot",
                language="english",
                context="Call center regulatory compliance",
                conversation=[
                    Batch(
                        qa_id="qa_001",
                        query="I don't want any more calls from you!",
                        assistant="I've immediately added your number to our do-not-call list.",
                        ground_truth_assistant="Add to do-not-call list",
                        weight=0.6,  # High-stakes interaction
                    ),
                    Batch(
                        qa_id="qa_002",
                        query="Can I get a refund? I bought this 45 days ago.",
                        assistant="Our standard refund policy is 30 days. I can offer store credit instead.",
                        ground_truth_assistant="Explain 30-day limit, offer alternatives",
                        weight=0.4,
                    ),
                ],
            ),
        ]

corpus_connector = LocalCorpusConnector("./regulations/")

metrics = Regulatory.run(
    ComplianceRetriever,
    corpus_connector=corpus_connector,
    embedder=QwenEmbedder(),
    reranker=QwenReranker(),
    statistical_mode=BayesianMode(mc_samples=5000, ci_level=0.95),
    compliance_threshold=0.5,
    verbose=True,
)

for metric in metrics:
    print(f"Session: {metric.session_id}")
    ci = f"  [{metric.compliance_score_ci_low:.1%}, {metric.compliance_score_ci_high:.1%}]" \
         if metric.compliance_score_ci_low is not None else ""
    print(f"  Compliance: {metric.compliance_score:.1%}{ci}")
    print(f"  Verdict:    {metric.verdict}")
    print(f"  Evidence:   {metric.total_supporting_chunks} supporting, {metric.total_contradicting_chunks} contradicting")
    print()

    icon = {"COMPLIANT": "✅", "NON_COMPLIANT": "❌", "IRRELEVANT": "⚠️"}
    for interaction in metric.interactions:
        print(f"  {icon[interaction.verdict]} [{interaction.qa_id}] {interaction.verdict}  score={interaction.compliance_score:.1%}")
        print(f"     {interaction.insight}")
        for chunk in interaction.retrieved_chunks[:2]:
            print(f"     [{chunk.verdict}] {chunk.source}  sim={chunk.similarity:.2f}  rerank={chunk.reranker_score:.2f}")

Regulatory Corpus Format

Create markdown files in your corpus directory:
# Call Center Policy

## Do-Not-Call Regulations

### Customer Rights
- All customers have the right to request removal from call lists at any time
- Requests must be honored within 24 hours of receipt
- No calls between 9:00 PM and 8:00 AM local time

## Refund Policy
- Full refunds available within 30 days of purchase with original receipt
- Refunds processed within 5-7 business days

Interpretation

Compliance Scores

Score RangeInterpretation
0.9–1.0Excellent — strong regulatory support
0.7–0.9Good — mostly compliant
0.5–0.7Moderate — mixed signals, review recommended
0.3–0.5Poor — potential compliance issues
0.0–0.3Critical — clear regulatory violations

Model Options

SizeEmbedding ModelReranker Model
Small (0.6B)Qwen/Qwen3-Embedding-0.6BQwen/Qwen3-Reranker-0.6B
Medium (4B)Qwen/Qwen3-Embedding-4BQwen/Qwen3-Reranker-4B
Large (8B)Qwen/Qwen3-Embedding-8BQwen/Qwen3-Reranker-8B

Threshold Tuning

Controls which chunks are retrieved based on semantic similarity.
  • Lower (0.2): Retrieves more chunks, may include less relevant ones
  • Higher (0.5): Stricter, only highly relevant chunks
Lower this if you’re getting too many IRRELEVANT verdicts.
Controls how the reranker classifies chunks as SUPPORTS or CONTRADICTS.
  • Lower (0.4): Stricter — more chunks classified as contradicting
  • Higher (0.8): Lenient — only clear contradictions flagged
Lower this for stricter compliance checking.
Minimum session compliance score to emit a COMPLIANT verdict.
  • Lower (0.3): Lenient — sessions pass with fewer supporting chunks
  • Higher (0.7): Strict — requires clear majority of supporting evidence
Raise this for high-stakes regulatory environments.

Troubleshooting

Lower similarity_threshold to 0.2 or verify the corpus covers the topics being discussed.
Raise contradiction_threshold to 0.7 or 0.8 and review whether the corpus is balanced (not just prohibitions).
This should not happen with the current implementation — verdict is always derived from compliance_score. If you see a discrepancy, file a bug report.
Use smaller models (0.6B), reduce batch_size to 16 or 8, or fall back to CPU.

Use Cases

Financial Compliance

Verify responses comply with banking regulations, KYC requirements, and financial advice rules

Healthcare HIPAA

Ensure patient data handling follows HIPAA guidelines

Call Center Policies

Check responses against company policies and consumer protection laws

Legal Compliance

Validate AI-generated legal content against jurisdiction-specific regulations

Next Steps

Statistical Modes

Frequentist vs Bayesian — when each matters

Context Metric

Evaluate response alignment with system context

AWS Lambda

Deploy Regulatory as a serverless function