Explainability
The Explainability module helps you understand why your language model produces specific outputs by computing token attributions.What is Token Attribution?
Token attribution answers the question: “Which parts of the input influenced the model’s output the most?” By analyzing how each input token contributes to the generated response, you can:- Debug unexpected behavior - Find which tokens cause problematic outputs
- Validate model reasoning - Ensure the model focuses on relevant parts of the input
- Improve prompts - Identify which instructions have the most impact
- Build trust - Provide explanations for model decisions
Key Features
Multiple Methods
11 attribution methods including LIME, Integrated Gradients, SHAP, and more
Flexible Granularity
Analyze at token, word, or sentence level
Extensible Design
Class-based architecture for easy customization and future migrations
HuggingFace Compatible
Works with any HuggingFace causal language model
Available Attribution Methods
Gradient-Based Methods
Fast methods that require differentiable models:| Method | Class | Description |
|---|---|---|
| Saliency | Saliency | Basic gradient magnitude (Simonyan et al., 2013) |
| Integrated Gradients | IntegratedGradients | Path-integrated gradients (Sundararajan et al., 2017) |
| GradientSHAP | GradientShap | SHAP with gradient sampling (Lundberg & Lee, 2017) |
| SmoothGrad | SmoothGrad | Noise-averaged gradients (Smilkov et al., 2017) |
| SquareGrad | SquareGrad | Squared gradient values (Hooker et al., 2019) |
| VarGrad | VarGrad | Gradient variance (Richter et al., 2020) |
| Input x Gradient | InputXGradient | Input-weighted gradients (Simonyan et al., 2013) |
Perturbation-Based Methods
Model-agnostic methods that work with any model:| Method | Class | Description |
|---|---|---|
| LIME | Lime | Local interpretable explanations (Ribeiro et al., 2013) |
| KernelSHAP | KernelShap | Kernel-based SHAP values (Lundberg & Lee, 2017) |
| Occlusion | Occlusion | Token removal impact (Zeiler & Fergus, 2014) |
| Sobol | Sobol | Sobol sensitivity indices (Fel et al., 2021) |
Quick Example
Design Philosophy
The explainability module follows these principles:Separation of Concerns
Separation of Concerns
The module focuses solely on attribution computation. Users are responsible for formatting prompts according to their model’s requirements, avoiding coupling with specific LLM formats.
Class-Based Methods
Class-Based Methods
Attribution methods are implemented as classes (not string enums) for:
- Easy extension with custom methods
- Future migration to custom implementations
- Clear separation between method implementations
Parser Interface
Parser Interface
A pluggable parser interface allows supporting different attribution libraries beyond interpreto.
Next Steps
Attributions Guide
Detailed guide on computing and interpreting attributions
Examples
See the explainability notebook with working examples