A Brief Tour of Interpretability
What does it mean for a model to be interpretable?
What is the Purpose of Interpretability?
A hefty amount of deep learning research, outside of the mechanistic interpretability community, nebulously evokes interpretability, but fails to elaborate on what interpretability looks like in the context of deep learning, and why it might be a useful ideal. In this article, we take a tour of interpretability to build an understanding of what it might mean for deep learning.
We propose control as the concrete goal of interpretability research. We believe that control can manifest in a variety of ways, some of which include:
- Concept erasure: Can we definitively erase a concept from a model?
- Concept steering and enhancement: Can we steer a model to prefer certain concepts over others?
- Improved output quality: Can we improve the quality of outputs to better match the objective?
- Adversarial robustness: Can we make models fool-proof and less brittle to hacks?
- Continual learning: Can we efficiently update models to incorporate new patterns?
- Compression: Can we shed redundancies to miniaturize models while maintaining the desired performance?
Levels and Categories
We adopt the suggestion by He et al. (2024) to use the three levels of analysis (Marr 2010) as a way to compare the goals of work across interpretability literature. In addition, we use the ‘paradigms of interpretability’ by Bereska and Gavves (2024) to describe the approaches to interpretability, resulting in the 12-way classification in Figure 7.
We note that the two systems are unrelated and were never designed with the other in mind.
Three Levels (Marr 2010; He et al. 2024)
Marr’s framework splits the goals of neuroscience research into three levels. There is an inherent vertical separation, going from general to specific.
- Computational: Given the required inputs, system state, and current time, what does the system output?
- capabilities, patterns, and constraints of the system; are there general principles?
- Algorithmic: What are the algorithmic transformations and how is the required information represented?
- how are concepts represented and transformed?
- Implementation: Which physical components of the system execute the computation?
- can functions be localized? is their specialization/division of labor? what happens if you ablate a component?
Taxonomy (Bereska and Gavves 2024)
The “paradigms” categorize interpretability research based on the tools and assumptions a study uses.
Behavioral: Analyzes input-output relations whilst assuming a black-box model.
Attributional: Traces predictions to individual input contributions based on gradients.
Concepts-based: Probes learned representations for high-level concepts and patterns.
Mechanistic: Studies the individual components of a model to discover causal relationships and precise computations.
The Tour
We pool together some research across deep learning and neuroscience to provide a snapshot of interpretability, in turn fitting it into the 12-way classification to get a lay of the land. The tour culminates in Figure 7.
Dictionary Learning
In compressed sensing, our core problem looks like \[ \mathbf{y} = A \mathbf{x}, \]where \(\mathbf{x} \in \mathbb{R}^n\) is (close to) \(k\)-sparse and we have access to measurements \(\mathbf{y} \in \mathbb{R}^m\). We would like to recover the vector \(\mathbf{x}\) from \(A \mathbf{x}\), with the hope that \(A\) has few rows.
In the context of deep learning, the \(\mathbf{x}\) we are recovering represents interpretable “features” (or concepts) and \(\mathbf{y}\) refers to the messy, entangled activations that a neural network generates.
Gradient-based
Gradient-based methods, very popular for computer vision networks, compute the gradient of an output with respect to some aspect of the model in order to determine how “important” a model element is for the selected output.
Training Dynamics
Training dynamics literature studies the evolution of network weights across the training regime
Matrix Factorization
Matrix factorization, while a very common and useful tool, is not commonly found as the central piece for interpretability analysis
Common Threads
We observe some common qualities that interpretability research looks for: (Polysemanticity, Sparsity, Additivity, Non-Negativity, Assumign linearity)
Sparsity: Sparse autoencoders/transcoders are trained to promote sparsity in their latent dimension, while a key discovery in reverse engineering the algorithm in grokking was discovered because the authors observed sparsity in the Fourier basis.
Functional assignment: Gradient-based attribution, dictionary learning, matrix factorization, and reverse engineering (from grokking), all have an element of assigning functionality to neurons or attention heads.
Learning phases: Grokking, DSB, and the TCA study (along with a lot of Koopman for DL literature) all suggest some notion of phase changes. By looking at metrics based on having reverse engineered the algorithm, the grokking paper suggests three phases of learning (i.e. memorization, circuit formation, cleanup) for that particular transformer; DSB suggests using higher orders of statistics and the TCA study suggests division of labor.
References
Citation
@online{aswani2025,
author = {Aswani, Nishant},
title = {A {Brief} {Tour} of {Interpretability}},
date = {2025-06-27},
url = {https://nishantaswani.com/articles/interpretability/interpretability.html},
langid = {en}
}