Skip to main content

Explainability

The Explainability module helps you understand why your language model produces specific outputs by computing token attributions.

What is Token Attribution?

Token attribution answers the question: “Which parts of the input influenced the model’s output the most?” By analyzing how each input token contributes to the generated response, you can:
  • Debug unexpected behavior - Find which tokens cause problematic outputs
  • Validate model reasoning - Ensure the model focuses on relevant parts of the input
  • Improve prompts - Identify which instructions have the most impact
  • Build trust - Provide explanations for model decisions

Key Features

Multiple Methods

11 attribution methods including LIME, Integrated Gradients, SHAP, and more

Flexible Granularity

Analyze at token, word, or sentence level

Extensible Design

Class-based architecture for easy customization and future migrations

HuggingFace Compatible

Works with any HuggingFace causal language model

Available Attribution Methods

Gradient-Based Methods

Fast methods that require differentiable models:
MethodClassDescription
SaliencySaliencyBasic gradient magnitude (Simonyan et al., 2013)
Integrated GradientsIntegratedGradientsPath-integrated gradients (Sundararajan et al., 2017)
GradientSHAPGradientShapSHAP with gradient sampling (Lundberg & Lee, 2017)
SmoothGradSmoothGradNoise-averaged gradients (Smilkov et al., 2017)
SquareGradSquareGradSquared gradient values (Hooker et al., 2019)
VarGradVarGradGradient variance (Richter et al., 2020)
Input x GradientInputXGradientInput-weighted gradients (Simonyan et al., 2013)

Perturbation-Based Methods

Model-agnostic methods that work with any model:
MethodClassDescription
LIMELimeLocal interpretable explanations (Ribeiro et al., 2013)
KernelSHAPKernelShapKernel-based SHAP values (Lundberg & Lee, 2017)
OcclusionOcclusionToken removal impact (Zeiler & Fergus, 2014)
SobolSobolSobol sensitivity indices (Fel et al., 2021)

Quick Example

from transformers import AutoModelForCausalLM, AutoTokenizer
from fair_forge.explainability import AttributionExplainer, Lime, Granularity

# Load model
model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen3-0.6B")
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3-0.6B")

# Format prompt (user responsibility - varies by model)
messages = [{"role": "user", "content": "What is gravity?"}]
prompt = tokenizer.apply_chat_template(messages, tokenize=False)

# Compute attributions
explainer = AttributionExplainer(model, tokenizer)
result = explainer.explain(
    prompt=prompt,
    target="Gravity is the force of attraction between objects.",
    method=Lime,
)

# Show most important words
for attr in result.get_top_k(5):
    print(f"'{attr.text}': {attr.score:.4f}")

Design Philosophy

The explainability module follows these principles:
The module focuses solely on attribution computation. Users are responsible for formatting prompts according to their model’s requirements, avoiding coupling with specific LLM formats.
Attribution methods are implemented as classes (not string enums) for:
  • Easy extension with custom methods
  • Future migration to custom implementations
  • Clear separation between method implementations
A pluggable parser interface allows supporting different attribution libraries beyond interpreto.

Next Steps

Attributions Guide

Detailed guide on computing and interpreting attributions

Examples

See the explainability notebook with working examples