A Brief Tour of Interpretability

What does it mean for a model to be interpretable?

Author
Affiliation
Published

June 27, 2025

What is the Purpose of Interpretability?

A hefty amount of deep learning research, outside of the mechanistic interpretability community, nebulously evokes interpretability, but fails to elaborate on what interpretability looks like in the context of deep learning, and why it might be a useful ideal. In this article, we take a tour of interpretability to build an understanding of what it might mean for deep learning.

We propose control as the concrete goal of interpretability research. We believe that control can manifest in a variety of ways, some of which include:

  • Concept erasure: Can we definitively erase a concept from a model?
  • Concept steering and enhancement: Can we steer a model to prefer certain concepts over others?
  • Improved output quality: Can we improve the quality of outputs to better match the objective?
  • Adversarial robustness: Can we make models fool-proof and less brittle to hacks?
  • Continual learning: Can we efficiently update models to incorporate new patterns?
  • Compression: Can we shed redundancies to miniaturize models while maintaining the desired performance?
The Goal of Interpretability

Ultimately, interpretability for deep learning should provide control: a harness to allow users to predictably manipulate, update, shrink, and secure deep networks.

Levels and Categories

We adopt the suggestion by He et al. (2024) to use the three levels of analysis (Marr 2010) as a way to compare the goals of work across interpretability literature. In addition, we use the ‘paradigms of interpretability’ by Bereska and Gavves (2024) to describe the approaches to interpretability, resulting in the 12-way classification in Figure 7.

We note that the two systems are unrelated and were never designed with the other in mind.

Three Levels (Marr 2010; He et al. 2024)

Marr’s framework splits the goals of neuroscience research into three levels. There is an inherent vertical separation, going from general to specific.

  • Computational: Given the required inputs, system state, and current time, what does the system output?
    • capabilities, patterns, and constraints of the system; are there general principles?
  • Algorithmic: What are the algorithmic transformations and how is the required information represented?
    • how are concepts represented and transformed?
  • Implementation: Which physical components of the system execute the computation?
    • can functions be localized? is their specialization/division of labor? what happens if you ablate a component?
Taxonomy (Bereska and Gavves 2024)

The “paradigms” categorize interpretability research based on the tools and assumptions a study uses.

  • Behavioral: Analyzes input-output relations whilst assuming a black-box model.

  • Attributional: Traces predictions to individual input contributions based on gradients.

  • Concepts-based: Probes learned representations for high-level concepts and patterns.

  • Mechanistic: Studies the individual components of a model to discover causal relationships and precise computations.

The Tour

We pool together some research across deep learning and neuroscience to provide a snapshot of interpretability, in turn fitting it into the 12-way classification to get a lay of the land. The tour culminates in Figure 7.

Dictionary Learning

In compressed sensing, our core problem looks like \[ \mathbf{y} = A \mathbf{x}, \]where \(\mathbf{x} \in \mathbb{R}^n\) is (close to) \(k\)-sparse and we have access to measurements \(\mathbf{y} \in \mathbb{R}^m\). We would like to recover the vector \(\mathbf{x}\) from \(A \mathbf{x}\), with the hope that \(A\) has few rows.

In the context of deep learning, the \(\mathbf{x}\) we are recovering represents interpretable “features” (or concepts) and \(\mathbf{y}\) refers to the messy, entangled activations that a neural network generates.

Sparse Autoencoders (SAE) attempt to unravel polysemanticity by increasing the dimensionality of activations to disentangle latent factors.

Let \(n\) be the input and \(m\) the output dimension, where typically \(m\geq n\). We have an encoder \(W_{enc} \in \mathbb{R}^{m\times n}\) , a decoder \(W_{dec} \in \mathbb{R}^{n\times m}\), and biases \(\mathbf{b}_{enc}\in \mathbb{R}^{m}, \mathbf{b}_{dec}\in \mathbb{R}^{n}\).

A feature \(f_{i}\) is defined as

\[ f_{i}(x) = \text{ReLU}(W_{enc}(\mathbf{x}-\mathbf{b}_{dec})+\mathbf{b}_{enc})_{i}. \] The model is trained to reconstruct

\[ \hat{\mathbf{x}} = W_{dec}\mathbf{f} + \mathbf{b}_{dec} \] with the objective function \[ \mathcal{L} = \frac{1}{|X|} \sum\limits_{x\in X} \lVert \mathbf{x} - \hat{\mathbf{x}} \rVert^{2}_{2} + \lambda \lVert \mathbf{f} \rVert_{1}. \]

Figure 1: Figure from (Bereska and Gavves 2024)

Notes from here

polysemanticity: when a neuron represents more than one feature.

superposition: neural networks “want to represent more features than they have neurons”, so they exploit a property of high-dimensional spaces to simulate a model with many more neurons.

A transcoder approximates the output of a transformer block’s MLP sublayer as a sparse, linear combination of features.

Let \(n\) be the input and \(m\) the output dimension, where typically \(m\geq n\). We have an encoder \(W_{enc} \in \mathbb{R}^{m\times n}\) , a decoder \(W_{dec} \in \mathbb{R}^{n\times m}\), and biases \(\mathbf{b}_{enc}\in \mathbb{R}^{m}, \mathbf{b}_{dec}\in \mathbb{R}^{n}\).

The activation of feature \(i\) is \[ \mathbf{z} = \text{ReLU}(W_{enc}(\mathbf{x})+\mathbf{b}_{enc}), \]

while output of the transcoder is \[ \text{TC}(\mathbf{x}) = W_{dec}\mathbf{z} + \mathbf{b}_{dec}. \] Unlike the SAE, the transcoder does not reconstruct the activations of the MLP sublayer. Rather, it is a direct approximation of the MLP sublayer with the goal of learning the output as a linear combination of features.

Hence, it uses the objective \[ \mathcal{L} = \lVert \text{MLP}(\mathbf{x}) - \text{TC}(\mathbf{x}) \rVert^{2}_{2} + \lambda \lVert \mathbf{z} \rVert_{1}. \] where the first term represents the “faithfulness” and the latter is a sparsity penalty.

Figure 2

Gradient-based

Gradient-based methods, very popular for computer vision networks, compute the gradient of an output with respect to some aspect of the model in order to determine how “important” a model element is for the selected output.

Attribution patching is an extension on activation patching, which we describe first. Activation patching centers on the intervention effect \(\mathcal{I}\)

\[ \mathcal{I}(n; x^{\text{clean}}, x^{\text{noise}}) = \mathcal{L}(\mathcal{M}(x^{\text{clean}}| \text{ do}(n \leftarrow n(x^{\text{noise}})))) - \mathcal{L}(\mathcal{M}(x^{\text{clean}})). \] where \(x^{\text{clean}}, x^{\text{noise}}\) are a pair of original tokens and perturbed tokens, \(\mathcal{M}\) refers to the model, \(\mathcal{L}\) refers to the error of the model, and \(n\) refers to an activation in the model.

Hence, \(\mathcal{I}\) measures the difference between the quality of the model on a clean input and the quality of a model on a clean input after one or more activations have been modified with that from a noisy input.

Attribution patching attempts to approximate the intervention effect above by using a first-order Taylor expansion around \(n(x^{\text{noise}})\approx n(x^{\text{clean}})\), resulting in

\[ \hat{\mathcal{I}}(n; x^{\text{clean}}, x^{\text{noise}}) = (n(x^{\text{noise}})- n(x^{\text{clean}}))^{T} \frac{\partial \mathcal{L}(\mathcal{M}(x^{\text{clean}}))}{\partial n}\bigg|_{n=n(x^{\text{clean}})}, \] a formulation that now allows computing the intervention effect for all nodes in the model at once. They proceed by computing the expected value over a dataset of clean and noisy input pairs to obtain a an expected contribution for each activation

\[ \hat{c}(n) = \mathbb{E}_{x^{\text{clean}}, x^{\text{noise}}} [|\hat{\mathcal{I}}(n; x^{\text{clean}}, x^{\text{noise}}) | ]. \]

Grad-CAM is a slightly more involved variation of the CAM method. Grad-CAM first computes a weighted average of the gradient of the score \(y^c\) for each class \(c\) with respect to each feature map activation \(A^{k} \in \mathbb{R}^{m \times n}\)

\[ \alpha_{k}^{c} = \frac{1}{Z} \sum\limits_{i}^{m} \sum\limits_{j}^{n} \frac{\partial{y^{c}}}{\partial{A^{k}_{ij}}}, \] where \(k\) refers to index of the feature map and \(Z\) is the sum of \(A^k\). The final output is produced by a weighted sum of all feature map activations passed through a ReLU function, resulting in a visualization that can be over-layed on the input.

\[ L^{c}= \text{ReLU}(\sum\limits_{k}\alpha^c_{k}A^k). \]

Figure 3: Figure from (Selvaraju et al. 2020)

Training Dynamics

Training dynamics literature studies the evolution of network weights across the training regime

The distributional simplicity bias (DSB) conjectures that any parametric model, trained on a classification task using stochastic gradient descent, initially exploits lower-order input statistics to classify its inputs. As training progresses, the network uses increasingly higher order statistics of the input.

This phenomenon is demonstrated empirically as a result of some cleverly designed experiments:

Refinetti, Ingrosso, and Goldt (2022) train ResNet models on approximations of the CIFAR-10 dataset which vary in how true they are to the original data. In addition, they train a model on the origin dataset as a baseline. Periodically through training, they test each of these models for accuracy on the original dataset. They find that for each model trained on the clone dataset, there is a point until which it follows the test accuracy curve (across training steps) of the model trained on the original dataset, then the accuracy ‘collapses’: the truer the approximation, the later the point of collapse.

Belrose et al. (2024) train a model on CIFAR-10 and test the model on approximations of the CIFAR-10 dataset. They observe that the accuracy on the simpler approximations peaks earlier in the training regime, compared to the more involved.

Figure 4

\(W_{E} \in \mathbb{R}^{d \times p}\) refers to the the embedding matrix, \(t_i\) is the \(i\)-th token, \(p_i\) is the position embedding for token \(i\).

\(x^{(0)}\) is the “checkpoint” in the residual stream. For a 1-layer transformer model:

  • \(x^{(0)}\) is right after the embedding matrix
  • \(x^{(1)}\) is after the attention component
  • \(x^{(2)}\) is after the MLP component.

For a given head \(j\):

  • \(W_{K}^{j}\) is the key matrix
  • \(W_{Q}^{j}\) is the query matrix
  • \(W_{V}^{j}\) is the value matrix
  • \(W_{O}^{j}\) is the attention output
  • \(W_O^j W_V^j (x^{(0)})\) is called the OV circuit \[ \begin{align*} x_i^{(0)} &= W_E t_i + p_i\\ A_j &= \text{softmax} \left( {x^{(0)}}^T {W_{K}^{j}}^T (W_Q^j x^{(0)}_{2})\right)\\ x^{(1)} &= \left[ \sum_{j} W_O^j W_V^j (x^{(0)} \cdot A^{j}) \right] + x^{(0)}_{2}\\ \text{MLP} &= \text{ReLU}(W_{\text{in}} x^{(1)})\\ x^{(2)} &= W_{\text{out}} \text{ReLU}(W_{\text{in}} x^{(1)}) + x^{(1)}\\ \text{Logits} &= W_U x^{(2)} \end{align*} \]

Caveat

The authors actually find that \(x^{(2)} = W_{\text{out}} \text{MLP} + x^{(1)}\)can be approximated as \(x^{(2)} \approx W_{\text{out}} \text{MLP}\) (so, skip the residual stream addition)

Then, we also define \(W_{L}= W_{U}W_{\text{out}}\), as the two matrices compose linearly.

Discovered by Power et al. (2022), grokking refers to a phenomenon that arises in small transformers during training on algorithmic tasks, where the test accuracy sharply increases long after achieving a perfect training accuracy.

Figure 5: Figure from (Nanda et al. 2022)

Nanda et al. (2022) train 1-layer transformers to perform addition mod \(P\). The input to the model is “\(a\;b\;=\)”, where \(a,b\) are \(P\)-dimensional vectors (they set \(P=113\)).

Periodicity in the Embedding Matrix

They apply a Fourier transform along the input dimension \(p\) of \(W_E\) and compute the \(\ell_2\) norm along the output dimension. They find that the embedding matrix is sparse in the Fourier basis (only 6 big norms), indicating about 6 key frequencies (\(\omega_k\)).

They also do this for attention head activations (2D matrix), MLP activations (2D matrix), and logits (3D matrix), observing that these are sparse in the Fourier basis as well.

A procedure for the logits would probably look like this, where \(T\) is the 3D tensor of logits for each input:

2D FT: \[ T(a,b,\text{logit}) \rightarrow \mathcal{F}[T](\omega_{a}, \omega_{b}, \text{logit}). \] \(\ell_2\)-norm: \[ \| \mathcal{F}(\omega_{a}, \omega_{b}, \cdot) \|_{2}= \sqrt{\sum\limits_{l=0}^{L}|\mathcal{F}(\omega_{a}, \omega_{b}, l)|^{2}}. \]

Key takeaway: The embedding matrix seems to bring things into a Fourier basis to obtain \(\sin (w_{k}a), \cos (w_{k}a), \sin (w_{k}b), \text{and} \cos (w_{k}b)\).

Multiplication with Attention Heads and MLP

They claim that the attention heads and MLP layers compute \(\cos(\omega_{k}(a+b))\) and \(\sin(\omega_{k}(a+b))\) using trig identities: \[ \begin{align*} \cos(\omega_{k}(a+b)) &= \cos(\omega_{k} a) \cos(\omega_{k} b) - \sin(\omega_{k} a) \sin(\omega_{k} b) \\ \sin(\omega_{k}(a+b)) &= \sin(\omega_{k} a) \cos(\omega_{k} b) + \cos(\omega_{k} a) \sin(\omega_{k} b) \end{align*} \]

MLP out and Unembedding

Then, they argue that the \(W_{L} = W_{U}W_{\text{out}}\) matrix composed with \(\text{MLP}\) computes the following trig identity \[ \cos(\omega_{k}(a+b-c)) = \cos(\omega_{k} (a+b)) \cos(\omega_{k} c) - \sin(\omega_{k} (a+b)) \sin(\omega_{k} c). \]

Note: check out (Redman et al. 2024) for Koopman connection to Grokking


Matrix Factorization

Matrix factorization, while a very common and useful tool, is not commonly found as the central piece for interpretability analysis

The authors use tensor component analysis (TCA) to cluster neurons based on their response to a cue (input) across stages of learning. In summary, they have activities of \(n\) neurons measured for \(t\) time steps for \(c\) cues. As a result, they are able to build a 3D tensor \(A \in \mathbb{R}^{n \times t \times c}\), which they decompose using TCA

\[ A_{ntc} = \sum\limits_{r=1}^{R} w_{n}^{r} b_{t}^{r} a_{c}^{r}, \] with the idea that \(\mathbf{w}^{r}\) is a prototypical firing rate pattern of neurons, \(\mathbf{b}^r\) is a temporal basis function, and \(\mathbf{a}^{r}\) selects for cues/inputs.

The claim is that TCA is a more “natural” extension of PCA without the orthogonality constraint

Figure 6: Figure from (Williams et al. 2018)

Common Threads

We observe some common qualities that interpretability research looks for: (Polysemanticity, Sparsity, Additivity, Non-Negativity, Assumign linearity)

Sparsity: Sparse autoencoders/transcoders are trained to promote sparsity in their latent dimension, while a key discovery in reverse engineering the algorithm in grokking was discovered because the authors observed sparsity in the Fourier basis.

Functional assignment: Gradient-based attribution, dictionary learning, matrix factorization, and reverse engineering (from grokking), all have an element of assigning functionality to neurons or attention heads.

Learning phases: Grokking, DSB, and the TCA study (along with a lot of Koopman for DL literature) all suggest some notion of phase changes. By looking at metrics based on having reverse engineered the algorithm, the grokking paper suggests three phases of learning (i.e. memorization, circuit formation, cleanup) for that particular transformer; DSB suggests using higher orders of statistics and the TCA study suggests division of labor.

Figure 7: Interpretability Categorization

References

Belrose, Nora, Quintin Pope, Lucia Quirke, Alex Mallen, and Xiaoli Fern. 2024. “Neural Networks Learn Statistics of Increasing Complexity.” February 13, 2024. http://arxiv.org/abs/2402.04362.
Bereska, Leonard, and Efstratios Gavves. 2024. “Mechanistic Interpretability for AI SafetyA Review.” August 23, 2024. http://arxiv.org/abs/2404.14082.
Bricken, Trenton, Adly Templeton, Joshua Batson, Brian Chen, Adam Jermyn, Tom Conerly, Nick Turner, Cem Anil, Carson Denison, and Amanda Askell. 2023. “Towards Monosemanticity: Decomposing Language Models with Dictionary Learning.” Transformer Circuits Thread 2. https://transformer-circuits.pub/2023/monosemantic-features.
Dunefsky, Jacob, Philippe Chlenski, and Neel Nanda. 2024. “Transcoders Find Interpretable LLM Feature Circuits.” arXiv.org. June 17, 2024. https://arxiv.org/abs/2406.11944v1.
He, Zhonghao, Jascha Achterberg, Katie Collins, Kevin Nejad, Danyal Akarca, Yinzhu Yang, Wes Gurnee, et al. 2024. “Multilevel Interpretability Of Artificial Neural Networks: Leveraging Framework And Methods From Neuroscience.” August 25, 2024. http://arxiv.org/abs/2408.12664.
Kramár, János, Tom Lieberum, Rohin Shah, and Neel Nanda. 2024. AtP*: An Efficient and Scalable Method for Localizing LLM Behaviour to Components.” March 1, 2024. https://doi.org/10.48550/arXiv.2403.00745.
Marr, David. 2010. Vision: A Computational Investigation into the Human Representation and Processing of Visual Information. MIT press.
McGuire, Kelly L., Oren Amsalem, Arthur U. Sugden, Rohan N. Ramesh, Jesseba Fernando, Christian R. Burgess, and Mark L. Andermann. 2022. “Visual Association Cortex Links Cues with Conjunctions of Reward and Locomotor Contexts.” Current Biology 32 (7): 1563–1576.e8. https://doi.org/10.1016/j.cub.2022.02.028.
Nanda, Neel, Lawrence Chan, Tom Lieberum, Jess Smith, and Jacob Steinhardt. 2022. “Progress Measures for Grokking via Mechanistic Interpretability.” In. https://openreview.net/forum?id=9XFSbDPmdW.
Power, Alethea, Yuri Burda, Harri Edwards, Igor Babuschkin, and Vedant Misra. 2022. “Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets.” January 6, 2022. http://arxiv.org/abs/2201.02177.
Redman, William T., Juan M. Bello-Rivas, Maria Fonoberova, Ryan Mohr, Ioannis G. Kevrekidis, and Igor Mezić. 2024. “Identifying Equivalent Training Dynamics.” June 4, 2024. https://doi.org/10.48550/arXiv.2302.09160.
Refinetti, Maria, Alessandro Ingrosso, and Sebastian Goldt. 2022. “Neural Networks Trained with SGD Learn Distributions of Increasing Complexity.” arXiv.org. November 21, 2022. https://arxiv.org/abs/2211.11567v2.
Selvaraju, Ramprasaath R., Michael Cogswell, Abhishek Das, Ramakrishna Vedantam, Devi Parikh, and Dhruv Batra. 2020. “Grad-CAM: Visual Explanations from Deep Networks via Gradient-Based Localization.” International Journal of Computer Vision 128 (2): 336–59. https://doi.org/10.1007/s11263-019-01228-7.
Williams, Alex H., Tony Hyun Kim, Forea Wang, Saurabh Vyas, Stephen I. Ryu, Krishna V. Shenoy, Mark Schnitzer, Tamara G. Kolda, and Surya Ganguli. 2018. “Unsupervised Discovery of Demixed, Low-Dimensional Neural Dynamics Across Multiple Timescales Through Tensor Component Analysis.” Neuron 98 (6): 1099–1115.e8. https://doi.org/10.1016/j.neuron.2018.05.015.

Citation

BibTeX citation:
@online{aswani2025,
  author = {Aswani, Nishant},
  title = {A {Brief} {Tour} of {Interpretability}},
  date = {2025-06-27},
  url = {https://nishantaswani.com/articles/interpretability/interpretability.html},
  langid = {en}
}
For attribution, please cite this work as:
Aswani, Nishant. 2025. “A Brief Tour of Interpretability.” June 27, 2025. https://nishantaswani.com/articles/interpretability/interpretability.html.