Exploring the Anatomy of Catastrophic Forgetting

Comparing the neural representations of two CNNs trained on different continual learning algorithms

Author

Affiliation

Published

May 12, 2023

Code

d3 = require("d3@7")
import {Legend, Swatches} from "@d3/color-legend"
import {pack} from "@esperanc/range-input-variations"

import {testMax, testMin} from "@niniack/forgetting-representations"
import {viewof heatmap_controls} from "@niniack/forgetting-representations"
import {heatmaps} from "@niniack/forgetting-representations"

import {viewof cam_controls_one} from "@niniack/forgetting-representations"
import {viewof cam_controls_two} from "@niniack/forgetting-representations"
import {cams} from "@niniack/forgetting-representations"

html`
<style>
        svg {
            display: block;
            margin: auto;
            align-items: center;
        }
</style>
`

Code

margin = ({top: 40, right: 0, bottom: 40, left: 40})
widthFactor = 1
visHeight = 300
visWidth = 1295

colorScale = d3.scaleSequential()
    .domain([testMin, testMax])
    .interpolator(d3.interpolateGreens)

Introduction

A key concept in continual learning (CL), catastrophic forgetting has been of interest since the 90’s. Simply put, it occurs when machine learning models forget past knowledge, from previous tasks, as they learn on new tasks. In recent literature, it is often discussed in the context of deep networks, usually to develop training algorithms or architectures to overcome the issue. Comparatively, little work has been done in analyzing how catastrophic forgetting manifests itself, or probing how existing algorithms affect model weights.

Two papers (see Ramasesh, Dyer, and Raghu 2020; Davari et al. 2022) that piqued my interest, approach the task by looking at neural representations of models. They ask the central question of “how does catastrophic forgetting affect the hidden representations of deep networks?” It is an important question because internal representations are often also used as methods for explainability. Further, these internal representations are what eventually generate the output; so, it is likely worth taking a look at how they evolve over the course of CL training regimes.

Inspired by these works, this blog post conducts a small-scale comparison between neural representations of models trained on different CL methods. I rely on a generalized framework to employ three distance metrics to compare two popular CL baselines, Learning without Forgetting (Li and Hoiem 2017) and Memory-Aware Synapses (Aljundi et al. 2018).

Before we get to the experiments, let’s first dive into some of the math!

A mini tutorial on distance metrics to compare neural representations

Williams et al. (2022) lay out a framework that generalizes comparison metrics CCA and the orthogonal Procrustes Distance.

In the following derivations, we assume two sets of neural activations \(\mathbf{X} \in \mathbb{R}^{m \times p}\) and \(\mathbf{Y} \in \mathbb{R}^{m \times p}\). We also assume these two matrices are mean-centered! The way to represent this for an arbitrary matrix \(\mathbf{A}\) is \(\mathbf{A}_0 = \mathbf{A} - \frac{1}{N}\mathbf{A}\unicode{x1D7D9}_{N\times N}\). Or, if you are a more visual person:

\[ \mathbf{A}_0 = \begin{bmatrix} | & & | \\ \mathbf{a}_1 & ... & \mathbf{a}_N \\ | & & | \\ \end{bmatrix} - \frac{1}{N} \begin{bmatrix} | & & | \\ \mathbf{a}_1 & ... & \mathbf{a}_N \\ | & & | \\ \end{bmatrix} \begin{bmatrix} | & & | \\ 1 & ... & 1 \\ | & & | \\ \end{bmatrix} \]

The Orthogonal Procrustes Distance

The orthogonal procrustes distance aims to find an orthogonal matrix \(\mathbf{Q^*}\) that transforms \(\mathbf{Y}\), such that it is close to \(\mathbf{X}\) in the Frobenius norm. Formally:

\[\begin{equation*} \begin{aligned} \mathbf{Q^*} = \operatorname*{argmin}_{\mathbf{Q} \in \mathbf{O}} \|\mathbf{X} - \mathbf{YQ}\| \end{aligned} \end{equation*}\]

which is equivalent to minimizing the squared Frobenius norm

\[\begin{equation*} \begin{aligned} \mathbf{Q^*} &= \operatorname*{argmin}_{\mathbf{Q} \in \mathbf{O}} \|\mathbf{X} - \mathbf{YQ}\|^2 \\ &= \operatorname*{argmin}_{\mathbf{Q} \in \mathbf{O}} \operatorname{trace}((\mathbf{X} - \mathbf{YQ})^T(\mathbf{X} - \mathbf{YQ})) \\ &= \operatorname*{argmin}_{\mathbf{Q} \in \mathbf{O}} \operatorname{trace}(\mathbf{X}^T\mathbf{X} - \mathbf{X}^T\mathbf{YQ} - (\mathbf{YQ})^T\mathbf{X} + (\mathbf{YQ})^T\mathbf{YQ}) \\ &= \operatorname*{argmin}_{\mathbf{Q} \in \mathbf{O}} \operatorname{trace}(\mathbf{X}^T\mathbf{X}) - \operatorname{trace}(\mathbf{Q}^T\mathbf{Y}^T\mathbf{X}) - \operatorname{trace}(\mathbf{X}^T\mathbf{YQ}) + \operatorname{trace}(\mathbf{Q}^T\mathbf{Y}^T\mathbf{YQ}) \\ &= \operatorname*{argmin}_{\mathbf{Q} \in \mathbf{O}} \operatorname{trace}(\mathbf{X}^T\mathbf{X}) - 2\operatorname{trace}(\mathbf{X}^T\mathbf{YQ}) + \operatorname{trace}(\mathbf{Q}^T\mathbf{Y}^T\mathbf{YQ}) \\ &= \operatorname*{argmin}_{\mathbf{Q} \in \mathbf{O}} \operatorname{trace}(\mathbf{X}^T\mathbf{X}) - 2\operatorname{trace}(\mathbf{X}^T\mathbf{YQ}) + \operatorname{trace}(\mathbf{Y}^T\mathbf{Y}) \\ &= \operatorname*{argmin}_{\mathbf{Q} \in \mathbf{O}} -2 \operatorname{trace}(\mathbf{X}^T\mathbf{YQ}) \\ &= \operatorname*{argmax}_{\mathbf{Q} \in \mathbf{O}} \operatorname{trace}(\mathbf{X}^T\mathbf{YQ}) \end{aligned} \end{equation*}\]

Now, we move on to another popular metric!

Canonical Correlations Analysis (CCA)

CCA seeks to find projection vectors \(w_x, w_y \in \mathbb{R}^{p \times 1}\) such that the correlation between \(\mathbf{X}w_x\) and \(\mathbf{Y}w_y\) is maximized. This problem be can expressed as:

\[\begin{equation*} \begin{aligned} & \underset{w_x, w_y}{\text{maximize}} & & \frac{w_{x}^{T} \mathbf{X}^{T} \mathbf{Y} w_{y}}{\sqrt{(w_{x}^{T} \mathbf{X}^{T} \mathbf{X} w_{x}) (w_{y}^{T} \mathbf{Y}^{T} \mathbf{Y} w_{y})}} \end{aligned} \end{equation*}\]

Correlations should be scale-invariant: two vectors should still be perfectly correlated if one of them is simply a scaled version of the other. So, we also impose the constraint that \((w_{x}^{T} \mathbf{X}^{T} \mathbf{X} w_{x}) = 1\) and \((w_{y}^{T} \mathbf{Y}^{T} \mathbf{Y} w_{y}) = 1\).

We can do this because dividing the projection \(w_{x}^{T} \mathbf{X}^{T}\) by the square root of its dot product, \(\sqrt{(w_{x}^{T} \mathbf{X}^{T} \mathbf{X} w_{x})}\), simply scales the projection such that its dot product is equal to 1.

By doing so, we can rewrite the objective above to the following:

\[\begin{equation*} \begin{aligned} & \underset{w_x, w_y}{\text{maximize}} & & w_{x}^{T} \mathbf{X}^{T} \mathbf{Y} w_{y} \\ & \text{subject to} & & w_{x}^{T} \mathbf{X}^{T} \mathbf{X} w_{x} = 1, \\ & & & w_{y}^{T} \mathbf{Y}^{T} \mathbf{Y} w_{y} = 1, \end{aligned} \end{equation*}\]

To avoid reducing the dimensionality of the neural representations, we can extend this problem to find weight matrices \({W}_x, {W}_y \in \mathbb{R}^{p \times p}\), effectively running CCA multiple times. This results in the slightly more complex optimization problem

\[\begin{equation*} \begin{aligned} & \underset{w_x^{(i)}, w_y^{(i)}}{\text{maximize}} & & \sum_{i} w_{x^{(i)}}^{T} \mathbf{X}^{T} \mathbf{Y} w_{y^{(i)}} \\ & \text{subject to} & & w_{x^{(i)}}^{T} \mathbf{X}^{T} \mathbf{X} w_{x^{(i)}} = 1, \quad \forall i, \\ & & & w_{y^{(i)}}^{T} \mathbf{Y}^{T} \mathbf{Y} w_{y^{(i)}} = 1, \quad \forall i, \end{aligned} \end{equation*}\]

where \(w_x^{(i)}\) and \(w_y^{(i)}\) are columns of \(W_x, W_y\). We are looking for multiple transformation vectors that solve the CCA problem. And, for each transformation, we are still ensuring that the variance is equal to 1.

But, that’s really equivalent to:

\[\begin{equation*} \begin{aligned} & \underset{W_x, W_y}{\text{maximize}} & & \text{trace}\left(W_{x}^{T} \mathbf{X}^{T} \mathbf{Y} W_{y}\right) \\ & \text{subject to} & & W_{x}^{T} \mathbf{X}^{T} \mathbf{X} W_{x} = I, \\ & & & W_{y}^{T} \mathbf{Y}^{T} \mathbf{Y} W_{y} = I, \end{aligned} \end{equation*}\]

Don’t believe it? Proof by PyTorch.

n = 10 
p = 5

# Random matrices X and Y
X = torch.randn(n, p)
Y = torch.randn(n, p)

# Random matrices Wx and Wy
W_x = torch.randn(p, p)
W_y = torch.randn(p, p)

# Loop method
sum = 0
for i in range(k):
  sum += W_x[:,i].T@X.mT@Y@W_y[:, i]

# Vectorized method
trace=torch.trace(W_x.mT@X.mT@Y@W_y)

assert torch.isclose(trace,sum)

Although this problem looks very different from the Procrustes distance, the two problems can actually be written in an equivalent form.

Transforming CCA to the Procrustes problem

To convert CCA into the Procrustes problem, we can employ a change of variables:

\[\begin{equation*} \begin{aligned} \tilde{W}_{x} &= (\mathbf{X}^{T} \mathbf{X})^{1/2} W_{x}, \\ \tilde{W}_{y} &= (\mathbf{Y}^{T} \mathbf{Y})^{1/2} W_{y}. \end{aligned} \end{equation*}\]

such that the problem is now:

\[\begin{equation*} \begin{aligned} & \underset{\tilde{W}_x, \tilde{W}_y}{\text{maximize}} & & \text{trace}\left(\tilde{W}_{x}^{T} (\mathbf{X}^{T} \mathbf{X})^{-1/2} \mathbf{X}^{T} \mathbf{Y} (\mathbf{Y}^{T} \mathbf{Y})^{-1/2} \tilde{W}_{y}\right) \\ & \text{subject to} & & \tilde{W}_{x}^{T} \tilde{W}_{x} = I, \\ & & & \tilde{W}_{y}^{T} \tilde{W}_{y} = I, \end{aligned} \end{equation*}\]

We can employ another change of variables to further simplify the form of the problem:

\[\begin{equation*} \begin{aligned} Q &= \tilde{W}_{y} \tilde{W}_{x}^{T}, \\ X^{\phi} &= \mathbf{X} (\mathbf{X}^{T} \mathbf{X})^{-1/2}, \\ Y^{\phi} &= \mathbf{Y} (\mathbf{Y}^{T} \mathbf{Y})^{-1/2}. \end{aligned} \end{equation*}\]

so that it takes the form:

\[\begin{equation*} \begin{aligned} \underset{\mathbf{Q} \in \mathbf{O}}{\text{maximize}} & & \text{trace}\left(\mathbf{X}^{\phi T} \mathbf{Y}^{\phi} \mathbf{Q}\right) \end{aligned} \end{equation*}\]

The two problems are of the same form; the only difference is that CCA solves the problem on the whitened versions, \(X^{\phi}\) and \(Y^{\phi}\), of the input matrices.

Generalizing to a framework

Returning back to the original CCA problem, before any change of variables, Williams et al. (2022) show that by modifying the constraints to introduce a hyperparameter \(\alpha\), one can interpolate between the two metrics:

\[\begin{equation*} \begin{aligned} & \underset{W_x, W_y}{\text{maximize}} & & \text{trace}\left(W_{x}^{T} \mathbf{X}^{T} \mathbf{Y} W_{y}\right) \\ & \text{subject to} & & W_{x}^{T} ((1-\alpha) \mathbf{X}^{T} \mathbf{X} + \alpha I) W_{x} = I, \\ & & & W_{y}^{T} ((1-\alpha) \mathbf{Y}^{T} \mathbf{Y} + \alpha I) W_{y} = I \end{aligned} \end{equation*}\]

Here, if \(\alpha = 0\), the problem is equivalent to CCA. On the other hand, if \(\alpha = 1\), the problem becomes equivalent to the Procrustes problem.

This is the generalized framework I use to compare the neural representations in the experiments! With no particular intuition, I pick \(\alpha = \{0, 0.5, 1\}\).

A couple baselines that mitigate catastrophic forgetting

Now, let’s quickly talk about the continual learning algorithms we will be using in the experiments: Learning without Forgetting and Memory-Aware Synapses.

Learning without Forgetting (LwF)

LwF employs knowledge distillation (KD) (hintonDistillingKnowledgeNeural2015set?) as the main tool to prevent forgetting. The algorithm assumes a multi-head task-IL setting, suggesting that a model can be viewed as having a of shared parameters \(\theta_s\), and multiple sets of task-specific parameters \(\theta_o\). Concretely, this translates to a CNN’s feature backbone, characterized by \(\theta_s\), and the CNN’s multiple feed-forward classifiers, collectively characterized by \(\theta_o\).

In each learning session, the goal of the algorithm is to add a new set of task-specific parameters \(\theta_n\), while maintaining \(\theta_o\) and \(\theta_s\) to perform well on previous tasks. When encountering a new task, prior to doing any learning on it, the model is evaluated with old data, through \(\theta_s\) and \(\theta_o\), to obtain responses \(\mathbf{\hat{y}}_o\). Then, as the model is learning on new tasks, output probabilities on the old data are still generated, resulting in \(\mathbf{y}_o\). Although \(\mathbf{\hat{y}}_o\) is very likely to be inaccurate, LwF aims to encourage \(\mathbf{y}_o\) to remain the same as \(\mathbf{\hat{y}}_o\), as the model trains on the new task. To do so, the algorithm uses a variation of the KD loss which encourages the outputs of the new network to approximate the outputs of the “previous” network:

\[\begin{equation*} \begin{aligned} L_{LwF} &= -H(\mathbf{y}'_o, \mathbf{\hat{y}}'_o) \\ &= \sum^l_i y_o^{'i} \log\hat{y}_o^{'i} \end{aligned} \end{equation*}\]

where \(l\) is the number of samples in the new task.

It’s not just a simple cross entropy loss because we are using the modified outputs \(\mathbf{y}'_o, \mathbf{\hat{y}}'_o\), instead of \(\mathbf{y}_o, \mathbf{\hat{y}}_o\):

\[\begin{equation*} \begin{aligned} y_o^{'i} = \frac{(y_o^{i})^{1/T}}{\sum_j(y_o^{j})^{1/T}} \qquad \hat{y}_o^{'i} = \frac{(\hat{y}_o^{i})^{1/T}}{\sum_j(\hat{y}_o^{j})^{1/T}} \end{aligned} \end{equation*}\]

Memory-Aware Synapses (MAS)

MAS is another regularization-based CL technique focused on identifying and preserving network parameters relevant for previous tasks, while learning a new task. It does this by calculating an importance weight for each parameter in the model with respect to the previous tasks. The importance of a layer’s parameters, \(\Omega_{l}\), are determined by perturbing the parameters \(\theta_{l} = {\theta_{ij}}\) with a small change \(\delta_{l} = {\delta{ij}}\) and measuring the distance between the squared norms. Let’s denote \(F_l(x^k; \theta_l)\) as the output of layer \(l\) when the input data point \(x^k\) is processed using parameters \(\theta_l\). When we perturb these parameters by a small amount \(\delta_l\), the output changes to \(F_l(x^k; \theta_l + \delta_l)\). The importance is the difference between these two outputs:

\[\begin{equation*} \begin{aligned} \Omega_{l} = \lVert F_l(x^k; \theta_l + \delta_l) \rVert^2_2 - \lVert F_l(x^k; \theta_l) \rVert^2_2 \end{aligned} \end{equation*}\]

But, that can be simplified with the first-order Taylor series approximation:

\[\begin{equation*} \begin{aligned} \Omega_{l} &\approx \lVert F_l(x^k; \theta_l) \rVert^2_2 + (\delta_l)\frac{\partial{\lVert F_l(x^k; \theta_l)\rVert^2_2}}{\partial{\theta_l}} - \lVert F_l(x^k; \theta_l) \rVert^2_2 \\ &\approx (\delta_l)\frac{\partial{\lVert F_l(x^k; \theta_l)\rVert^2_2}}{\partial{\theta_l}} \end{aligned} \end{equation*}\] Then, it’s a matter of adding the following regularization term to the overall loss: \[\begin{equation*} \begin{aligned} L_{MAS} = \sum_l \Omega_{l}(\theta_l - \theta_l^*) \end{aligned} \end{equation*}\] where \(\theta_l^*\) are the “optimal”, old parameters for the previous task.

With the necessary background out of the way, we can (finally) move onto the experimental setup.

Setting up the experiment

I trained three multi-head VGG networks, one using the LwF strategy, the other using the MAS strategy, and one without any CL strategy (“Naive” strategy). The trainset, consisting of 100k images from the 200 classes, was split into 10 tasks, each made up of 10k images from 20 classes. Similarly, the testset, consisting of 10k images from 200 classes, was split into 10 tasks, resulting in 1k images from 20 classes per experience.

The models were trained using stochastic gradient descent (SGD) with 0.9 momentum and a constant \(1\times10^{-3}\) learning rate. After learning each task, a snapshot of the model, \(M_{task}^{strat}\), was saved, resulting in 30 snapshots (10 snapshots for each of the three approaches). For example, \(M_{2}^{LwF}\), is a CNN incrementally trained, using the LwF algorithm, upto task 2.

After training, each snapshot was evaluated on all experiences. As a result, many models were evaluated on tasks that they never learned on. To keep the experiment computationally reasonable, each evaluation was limited to 1k images from the 20 classes. During each evaluation, I hooked into the intermediate layers of the model to store the neural activations, \(\mathbf{X}_{t}^{l}\), where \(t\) is the task and \(l\) is the layer.

Then, with the layers held constant, I fit the neural activations, from different strategies, across all experiences, such that I was solving the problem:

\[\begin{equation*} \begin{aligned} & \underset{W_x, W_y}{\text{maximize}} & & \text{trace}\left(W_{x}^{T} (\mathbf{X}^l_t)^{T} \mathbf{Y}^l_t W{y}\right) \\ & \text{subject to} & & W_{x}^{T} ((1-\alpha) (\mathbf{X}^l_t)^{T} \mathbf{X}^l_t + \alpha I) W{x} = I, \\ & & & W_{y}^{T} ((1-\alpha) (\mathbf{Y}^l_t)^{T} \mathbf{Y}^l_t + \alpha I) W{y} = I \end{aligned} \end{equation*}\]

to obtain the transformation matrices \(W_{x}, W_{y}\).

Then to get a distance score I solve:

\[\begin{equation*} d(\mathbf{X}^l_t, \mathbf{Y}^l_t) = \text{arccos}\left(\frac{\langle \mathbf{X}^l_tW_{x} , \mathbf{Y}^l_tW_{y} \rangle }{\|\mathbf{X}^l_tW_{x}\| \|\mathbf{Y}^l_tW_{y}\|}\right) \end{equation*}\]

Take a look!

Heatmap comparisons of layers

Given the not-so-great hyperparameter selection, the model did not do a great job of learning. So, I ran the experiments on 1k samples from both the train and test datasets, expecting that the activations would differ, but it doesn’t seem like it.

The heatmaps are all generated with the final model, trained on all experiences, for each approach. I picked layers 3, 6, 8, 11, and 13 from the VGG to do the comparisons. And, as described earlier, you can see how the distances change based on the distance method we use. As a reminder, \(\alpha = 0\) is CCA, \(\alpha = 0.5\) is a ridge regularized CCA, and \(\alpha = 1\) is the Procrustes distance.

Dissecting the heatmaps

From the heatmaps, we see, and probably expected, that the two algorithms lead to neural activations that are relatively similar (leftmost heatmap).

Generally, the MAS algorithm leads to neural representations that are most different from the naive implementation, because the heatmap is darker across the grid. With the assumption that the naive implementation is the worst outcome, it might imply that MAS is a more effective algorithm at preventing forgetting; after all, its the furthest away from the worst outcome.

LwF FTW!

Plotting the accuracy for each final model, across all tasks, LwF is clearly the superior model! It seems like MAS has led to internal representations that have diverged in a direction that harms performance.

Code

labels = ["Naive", "Memory-Aware Synapses (MAS)", "Learning without Forgetting (LwF)"]
colors = d3.scaleOrdinal(labels, d3.schemeCategory10)
Swatches(colors)

Code

{
  const data = {
    naive_acc: [0.0390, 0.0270, 0.0520, 0.0250, 0.0550, 0.0470, 0.0510, 0.0840, 0.0740, 0.4300],
    mas_acc: [0.3850, 0.3450, 0.3390, 0.3380, 0.4140, 0.3530, 0.3290, 0.3830, 0.4500, 0.4310],
    lwf_acc: [0.5050, 0.5030, 0.5400, 0.4690, 0.5200, 0.4790, 0.4030, 0.4200, 0.4860, 0.4310]
  };

  const nameMapper = {
    ["naive_acc"]: labels[0],
    ["mas_acc"]: labels[1],
    ["lwf_acc"]: labels[2]
  };

  // const colors = { naive_acc: 'blue', mas_acc: 'red', lwf_acc: 'green' };

  const width = 900, height = 400;
  const margin = ({top: 20, right: 20, bottom: 40, left: 50});

  const svg = d3.create("svg")
      .attr("width", width)
      .attr("height", height);

  const xScale = d3.scaleLinear()
      .domain([0, 9])
      .range([margin.left, width - margin.right]);

  const yScale = d3.scaleLinear()
      .domain([0, 1])
      .range([height - margin.bottom, margin.top]);

  const line = d3.line()
      .defined(d => !isNaN(d))
      .x((d, i) => xScale(i))
      .y(d => yScale(d));

  for (const name in data) {
    const datum = data[name];
    svg.append("path")
      .datum(datum)
      .attr("fill", "none")
      .attr("stroke", colors(nameMapper[name]))
      .attr("stroke-width", 1.5)
      .attr("d", line);
  }

  svg.append("g")
      .attr("transform", `translate(0,${height - margin.bottom})`)
      .call(d3.axisBottom(xScale).ticks(9).tickFormat(i => i + 1));

  svg.append("g")
      .attr("transform", `translate(${margin.left},0)`)
      .call(d3.axisLeft(yScale));

  svg.append("text")
      .attr("transform", `translate(${width / 2},${height - 5})`)
      .style("text-anchor", "middle")
      .style("font-size", "12px")
      .text("Experience");

  svg.append("text")
      .attr("transform", "rotate(-90)")
      .attr("y", margin.left - 35)
      .attr("x", -(height / 2))
      .style("text-anchor", "middle")
      .style("font-size", "12px")
      .text("Accuracy");

  svg.append("text")
      .attr("x", width / 2)
      .attr("y", margin.top)
      .attr("text-anchor", "middle")
      // .style("text-decoration", "underline")
      .text("Final Accuracy on All Tasks");

  return svg.node();
}

Where does forgetting happen?

Another interesting phenomenon is how the heatmaps change across different selections of \(\alpha\). Surprisingly, Ridge CCA and Procrustes distance (regularized metrics) lead to the same heatmaps. (Although, this might be an experimental error.)

However, for earlier layers of the model, the regularized metrics are noticeably lighter than those produced by CCA. And, for later layers, the regularized metrics are significantly darker! So, the regularized metrics are able to more drastically distinguish similarities between representations across the layers. And according to that, the changes in internal representations across algorithms is more concentrated in later layers, suggesting that forgetting is a result of changes in later layers.

Looking at layers 6, 8, and 11 with the regularized metrics, there seems to be a noticeable shift in internal representations between the CL algorithms and the naive approach at experience 5. This could signify that task 5 is significantly different from previously seen experiences. In that case, an analysis like this might help determine which tasks are “problematic” or difficult to learn. Or, it might signify a capacity of the model.

Visualizing the class activation maps (CAMs)

While looking at CAMs isn’t a rigorous analysis, it’s still fun!

Scrolling through the experiences and different layers, the CAMs tell a similar story to the heatmaps. The earlier layers produce CAMs that are visually similar, while the later layers have more obvious differences.

Most CAMs produced from the naive approach have far fewer “bright spots”: the CAMs are darker and more sparse. On the other hand, the CAMs produced by MAS are are much brighter and more activated than the LwF CAMs. This seems to be more pronounced in the later layers.

What might be next!

Building thousands of saliency maps and then running distance metrics on is computationally expensive and so was out of the scope of this project. But, that seems like the natural next step!

Do comparisons with regularized metrics between the CAMs of different algorithms tell the same story as the heatmaps we saw?

Is there something we can learn from this to design a better CL algorithm that improves saliency AND prevents forgetting?

If we did this for several CL algorithms, could we somehow cluster them by how similar their internal representations are? Could we do this by layer? What about for different architectures?

Last updated

2024-09-16 08:03:56 UTC

Corrections

If you see mistakes or want to suggest changes, please send me a message at nishantaswani@nyu.edu. Suggestions are appreciated!

Reuse

Generated text and figures are licensed under Creative Commons Attribution CC BY 4.0. The figures that have been reused from other sources don’t fall under this license and can be recognized by a note in their caption: ‘Figure from …’

References

Aljundi, Rahaf, Francesca Babiloni, Mohamed Elhoseiny, Marcus Rohrbach, and Tinne Tuytelaars. 2018. “Memory Aware Synapses: Learning What (Not) to Forget.” October 5, 2018. https://doi.org/10.48550/arXiv.1711.09601.

Davari, MohammadReza, Nader Asadi, Sudhir Mudur, Rahaf Aljundi, and Eugene Belilovsky. 2022. “Probing Representation Forgetting in Supervised and Unsupervised Continual Learning.” In 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 16691–700. New Orleans, LA, USA: IEEE. https://doi.org/10.1109/CVPR52688.2022.01621.

Kudithipudi, Dhireesha, Mario Aguilar-Simon, Jonathan Babb, Maxim Bazhenov, Douglas Blackiston, Josh Bongard, Andrew P. Brna, et al. 2022. “Biological Underpinnings for Lifelong Learning Machines.” Nature Machine Intelligence 4 (3): 196–210. https://doi.org/10.1038/s42256-022-00452-0.

Li, Zhizhong, and Derek Hoiem. 2017. “Learning Without Forgetting.” February 14, 2017. https://doi.org/10.48550/arXiv.1606.09282.

Ramasesh, Vinay V., Ethan Dyer, and Maithra Raghu. 2020. “Anatomy of Catastrophic Forgetting: Hidden Representations and Task Semantics.” July 14, 2020. https://doi.org/10.48550/arXiv.2007.07400.

Selvaraju, Ramprasaath R., Michael Cogswell, Abhishek Das, Ramakrishna Vedantam, Devi Parikh, and Dhruv Batra. 2020. “Grad-CAM: Visual Explanations from Deep Networks via Gradient-Based Localization.” International Journal of Computer Vision 128 (2): 336–59. https://doi.org/10.1007/s11263-019-01228-7.

Williams, Alex H., Erin Kunz, Simon Kornblith, and Scott W. Linderman. 2022. “Generalized Shape Metrics on Neural Representations.” January 12, 2022. http://arxiv.org/abs/2110.14739.

Citation

BibTeX citation:

@online{aswani2023,
  author = {Aswani, Nishant},
  title = {Exploring the {Anatomy} of {Catastrophic} {Forgetting}},
  date = {2023-05-12},
  url = {https://nishantaswani.com/articles/anatomy.html},
  langid = {en}
}

For attribution, please cite this work as:

Aswani, Nishant. 2023. “Exploring the Anatomy of Catastrophic Forgetting.” May 12, 2023. https://nishantaswani.com/articles/anatomy.html.