Skip to main content

Documentation Index

Fetch the complete documentation index at: https://fairforge.alquimia.ai/llms.txt

Use this file to discover all available pages before exploring further.

Token Attributions

This guide covers how to compute, interpret, and visualize token attributions using Fair Forge’s explainability module.

Installation

pip install "alquimia-fair-forge[explainability]"
This installs:
  • interpreto - Attribution computation library
  • torch - PyTorch for model inference
  • transformers - HuggingFace model support

Basic Usage

Step 1: Load Your Model

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

# Disable autocast for attribution compatibility
torch.set_autocast_enabled(False)

model = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen3-0.6B",
    torch_dtype=torch.float16,
)
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3-0.6B")

Step 2: Format Your Prompt

You must format prompts according to your model’s requirements. The explainability module does not handle prompt formatting to avoid coupling with specific LLM formats.
messages = [
    {"role": "system", "content": "Answer concisely."},
    {"role": "user", "content": "What is photosynthesis?"}
]

# Use the tokenizer's chat template
prompt = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)

Step 3: Create Explainer and Compute

from fair_forge.explainability import AttributionExplainer, Lime, Granularity

explainer = AttributionExplainer(
    model=model,
    tokenizer=tokenizer,
    default_method=Lime,
    default_granularity=Granularity.WORD,
    verbose=True,
)

result = explainer.explain(
    prompt=prompt,
    target="Photosynthesis is the process plants use to convert sunlight into energy.",
)

Step 4: Analyze Results

# Get top contributing words
print("Top 10 Most Important Words:")
for attr in result.get_top_k(10):
    print(f"  '{attr.text}': {attr.score:+.4f}")

# Access all attributions
for attr in result.attributions:
    print(f"Position {attr.position}: '{attr.text}' = {attr.score:.4f}")

Configuration Options

AttributionExplainer Parameters

ParameterTypeDefaultDescription
modelPreTrainedModelRequiredHuggingFace model
tokenizerPreTrainedTokenizerRequiredHuggingFace tokenizer
default_methodtype[BaseAttributionMethod]LimeDefault attribution method class
default_granularityGranularityWORDDefault granularity level
result_parserAttributionResultParserInterpretoResultParserCustom result parser
verboseboolFalseEnable verbose logging

explain() Parameters

ParameterTypeDefaultDescription
promptstrRequiredPre-formatted input prompt
targetstrRequiredModel output to explain
methodtype[BaseAttributionMethod]Instance defaultAttribution method class
granularityGranularityInstance defaultGranularity level
max_lengthint512Maximum sequence length

Granularity Levels

Choose the appropriate granularity for your use case:
from fair_forge.explainability import Granularity

result = explainer.explain(
    prompt=prompt,
    target=target,
    granularity=Granularity.TOKEN,
)
Use when: You need fine-grained analysis of individual tokens, debugging tokenization issues.Example output: ['▁What', '▁is', '▁grav', 'ity', '?']

Attribution Methods

Using Different Methods

Pass the method class directly to explain():
from fair_forge.explainability import (
    Lime, Occlusion, Saliency, IntegratedGradients
)

# LIME (perturbation-based, recommended default)
result_lime = explainer.explain(prompt=prompt, target=target, method=Lime)

# Occlusion (perturbation-based)
result_occ = explainer.explain(prompt=prompt, target=target, method=Occlusion)

# Saliency (gradient-based, faster)
result_sal = explainer.explain(prompt=prompt, target=target, method=Saliency)

# Integrated Gradients (gradient-based, more accurate)
result_ig = explainer.explain(prompt=prompt, target=target, method=IntegratedGradients)

Method Comparison

MethodSpeedAccuracyBest For
LimeMediumHighGeneral use, model-agnostic
OcclusionSlowHighWhen you need robustness
SaliencyFastMediumQuick debugging
IntegratedGradientsMediumHighDetailed analysis
KernelShapSlowVery HighWhen accuracy is critical

Batch Processing

Process multiple prompt/target pairs efficiently:
# Prepare batch (prompts must be pre-formatted)
items = [
    (
        tokenizer.apply_chat_template(
            [{"role": "user", "content": "What is AI?"}],
            tokenize=False, add_generation_prompt=True
        ),
        "AI is artificial intelligence."
    ),
    (
        tokenizer.apply_chat_template(
            [{"role": "user", "content": "What is ML?"}],
            tokenize=False, add_generation_prompt=True
        ),
        "ML is machine learning, a subset of AI."
    ),
]

# Process batch
batch_results = explainer.explain_batch(items)

print(f"Processed {len(batch_results)} items")
print(f"Total time: {batch_results.total_compute_time_seconds:.2f}s")

for i, result in enumerate(batch_results):
    print(f"\nItem {i+1}: Top words = {[a.text for a in result.get_top_k(3)]}")

Output Schema

AttributionResult

class AttributionResult(BaseModel):
    prompt: str                      # Input prompt
    target: str                      # Target output
    method: AttributionMethod        # Method used
    granularity: Granularity         # Granularity level
    attributions: list[TokenAttribution]  # Token scores
    metadata: dict[str, Any]         # Compute time, etc.

TokenAttribution

class TokenAttribution(BaseModel):
    text: str              # Token/word/sentence text
    score: float           # Attribution score (can be negative)
    position: int          # Position in sequence
    normalized_score: float | None  # Score normalized to [0, 1]

Useful Methods

# Get top K most important tokens
top_tokens = result.get_top_k(10)

# Get all attributions sorted by importance
sorted_attrs = result.top_attributions

# Export for visualization
viz_data = result.to_dict_for_visualization()
# Returns: {"tokens": [...], "scores": [...], "normalized_scores": [...]}

# Export as dict (JSON-serializable)
result_dict = result.model_dump()

Visualization

In Jupyter Notebooks

# Display interactive visualization
explainer.visualize(result)

Get HTML for Custom Display

# Get HTML string
html = explainer.visualize(result, return_html=True)

# Use in web apps, save to file, etc.
with open("attribution.html", "w") as f:
    f.write(html)

Custom Result Parsers

Implement custom parsers for different attribution libraries:
from fair_forge.explainability import (
    AttributionExplainer,
    AttributionResultParser,
)

class MyCustomParser(AttributionResultParser):
    def parse(self, raw_result) -> tuple[list[str], list[float]]:
        # Extract tokens and scores from your custom format
        tokens = raw_result["words"]
        scores = raw_result["importance"]
        return tokens, scores

# Use custom parser
explainer = AttributionExplainer(
    model=model,
    tokenizer=tokenizer,
    result_parser=MyCustomParser(),
)

Convenience Function

For one-off computations without creating an explainer:
from fair_forge.explainability import compute_attributions, Lime

result = compute_attributions(
    model=model,
    tokenizer=tokenizer,
    prompt=prompt,
    target=target,
    method=Lime,
    granularity=Granularity.WORD,
)

Interpreting Results

Positive vs Negative Scores

  • Positive scores: Token increases likelihood of the target output
  • Negative scores: Token decreases likelihood of the target output
  • Near-zero scores: Token has minimal impact

Normalized Scores

Normalized scores map to [0, 1] range for easier comparison:
  • 1.0 = Maximum positive contribution
  • 0.5 = Neutral (when all scores are equal)
  • 0.0 = Maximum negative contribution

Example Analysis

print("Analysis of attribution results:")
print("=" * 50)

# Positive contributors
positive = [a for a in result.attributions if a.score > 0]
print(f"\nPositive contributors ({len(positive)} tokens):")
for attr in sorted(positive, key=lambda x: x.score, reverse=True)[:5]:
    print(f"  '{attr.text}': +{attr.score:.4f}")

# Negative contributors
negative = [a for a in result.attributions if a.score < 0]
print(f"\nNegative contributors ({len(negative)} tokens):")
for attr in sorted(negative, key=lambda x: x.score)[:5]:
    print(f"  '{attr.text}': {attr.score:.4f}")

Best Practices

Attribution methods work better with float16 precision:
model = AutoModelForCausalLM.from_pretrained(
    "model-name",
    torch_dtype=torch.float16,
)
Some attribution methods conflict with autocast:
torch.set_autocast_enabled(False)
LIME is a good default - it’s model-agnostic and provides reliable results. Switch to gradient methods if you need speed.
Word-level attributions are most interpretable. Use token-level only for debugging tokenization issues.
Different methods may highlight different aspects. Compare results from 2-3 methods for important analyses.

Troubleshooting

Reduce max_length or use a smaller model:
result = explainer.explain(prompt=prompt, target=target, max_length=256)
Check that your prompt and target are not empty and that the model can process them:
# Verify tokenization works
tokens = tokenizer(prompt, return_tensors="pt")
print(f"Prompt has {tokens.input_ids.shape[1]} tokens")
Switch to faster gradient-based methods:
from fair_forge.explainability import Saliency
result = explainer.explain(prompt=prompt, target=target, method=Saliency)

Next Steps

Explainability Overview

Learn about the module design and available methods

Example Notebook

See working examples in Jupyter