Discussing PCA and Linear Autoencoders

An attempt at reproducing PCA with a linear autoencoder and how that might mean something about learning.

Author
Affiliation
Published

May 15, 2024

Note: A lot of this article is my written understanding of this amazing talk. The researchers do a great job of whiteboarding their paper and fielding questions!

Another Note: This article is incomplete; it ends very abruptly. Even without a satisfying conclusion, there are still things in here that were new to me. So, I’m pushing this out, otherwise it will never see the light of day. If you want to help me finish this off, I’m looking for a “so what?…”

A Motivation

Most of deep learning relies heavily on backprop(agation). At its core, backprop is an effective method for solving the ‘credit assignment’ problem in artificial networks. Using some clever calculus, the algorithm computes, layer by layer, a solution to the credit assignment problem. Most interestingly, backprop does this at the cost roughly similar to that of a forward pass in the network.

To illustrate backprop’s low cost, let’s assume we have some neural network and an appropriate cost function \(C(y, \hat{y})\) that measures the performance of our network. Now, we want to build a measure of how much credit to assign to a given neuron \(j\) in layer \(l\). To do so, we define \(z_j^l\) as the input to the \(j\) th neuron in the \(l\) th layer. We can reflect, in our cost function, that the network’s output is dependent on this variable (which lives somewhere in the network), giving us \(C(y, \hat{y}(z_j^l))\).

If we slightly perturb the input into this neuron, by adding some noise \(\Delta z\), we can expect that our cost function will output a different value from when there is no injected noise. And, we can compute this change: \[ C(y, \hat{y}(z_j^l)) - C(y, \hat{y}(z_j^l+\Delta z)) \]

Pausing here, what might this difference reflect? I think of it as some heuristic of ‘robustness’ of the neuron. Others call it the “error of the neuron” (Nielsen 2015). Some continual learning researchers have used this as a measure of “importance”(Aljundi et al. 2017). Either way, if our neuron is ‘optimal’, injecting some noise at the input of the neuron should not have much of an effect on the cost. On the other hand, if the neuron is ‘sub-optimal’, then a small change in input might result in a drastically different output (Nielsen 2015).

Rearranging a first order Taylor-series approximation of our post-perturbation cost, we can think about more concretely describing the neuron’s ‘error’, \(\delta^l_j\).

\[ \begin{align*} C(y, \hat{y}(z_j^l+\Delta z)) &= C(y, \hat{y}(z_j^l)) - \Delta z \frac{\partial C(y, \hat{y}(z_j^l))}{\partial z_j^l} \\ C(y, \hat{y}(z_j^l)) - C(y, \hat{y}(z_j^l+\Delta z)) &= \Delta z \frac{\partial C(y, \hat{y}(z_j^l))}{\partial z_j^l} \end{align*} \]

Particularly, \(\frac{\partial C}{\partial z_j^l}\) is the heuristic we are keen on. If this value is low, a \(\Delta z\) won’t significantly affect the cost. If it is large, \(\Delta z\) can significantly change the cost. So, we take this as our heuristic and set \(\delta^l_j = \frac{\partial C}{\partial z_j^l}\)

We want to express this neuron’s error with regards to the layer sitting ahead of it, \(l+1\); that is where the signal is backpropagating from. In doing so, we might learn something about the cost of this operation for a single layer \(l\).

\[ \begin{align*} \delta^l_j &= \frac{\partial C}{\partial z_j^l} \\ &= \sum_{k}\frac{\partial C}{\partial z_k^{l+1}} \frac{\partial z_k^{l+1}}{\partial z_j^l} \\ &= \sum_{k} \delta^{l+1}_j \frac{\partial z_k^{l+1}}{\partial z_j^l} \end{align*} \]

Rewriting the result above with respect to weights and biases, we arrive at:

\[ \begin{align*} \delta^l_j &= \sum_{k} \delta^{l+1}_j \frac{\partial \sum_j (w^{l+1}_{kj} \sigma(z_j^l) + b_{l+1}^{k}) }{\partial z_j^l} \\ &= \sum_{k} \delta^{l+1}_j w^{l+1}_{kj} \sigma'(z_j^l) \\ &= {\mathbf{w}^{l+1}_{j}}^T \mathbf{\delta}^{l+1} \sigma'(z_j^l) \end{align*} \]

Vectorizing this, to obtain the error for every neuron in layer \(l\):

\[ \begin{align*} \mathbf{\delta}^l &= {\mathbf{W}^{l+1}}^T \mathbf{\delta}^{l+1} \sigma'(z^l) \end{align*} \]

So, to compute the error for layer \(l\), we need the error of layer \(l+1\) to “travel along” the transpose of the weight matrix in layer \(l+1\). Not only is this a cheap operation, its also a bit surprising: error calculation takes advantage of exactly the same weights as a forward pass.

From a machine learning perspective, this is great news! Backprop is “symmetric” and is computationally forgiving. However, for neuroscience, it is precisely this symmetry that casts doubt on backprop as a biologically viable approach. Because, in the brain, this would imply that the feedback connections somehow have the exact same “weights” as the feedforward connections. This seems like an unreasonable constraint, since it is a separate set of neurons for both connections (Bengio et al. 2015; Lillicrap et al. 2020).

Figure from ‘The Office.’

Unfolding backpropagation

The figure above proposes an interesting reframing of the algorithm. Conceptually, in a feedforward phase, we travel from one weight space to another, while backprop aids us in traveling backwards along this exact path to “reconstruct” the weight space for updates. Weight symmetry is not an issue. But, if we “unfold” backpropagation, we are left with an architecture that uses two distinct sets of neurons for a reconstruction, resembling an autoencoder.

Do autoencoders learn symmetric weights on both sets of connections? Can we use autoencoders to study a biologically plausible backpropagation alternative?

Linear Autoencoders (LAE) and PCA

Autoencoders

Autoencoders recreate the input at the output. But teaching a neural network to exactly copy its input would lead to learning the identity function, which would hardly be useful.

Instead, autoencoders are somehow restricted so that they can only approximately copy the input, as well as only recreate inputs similar to their training data. In doing so, autoencoders are capable of learning some meaningful properties, known as latent features, of the input data (Goodfellow, Bengio, and Courville 2016).

An autoencoder consists of:

  • An encoder: the part of the network that compresses the high-dimensional input into a lower-dimensional representation. This is the part that does “dimensionality reduction.”

  • A decoder: the part of the network that decompresses the low-dimensional representation back into the original dimensions.

We deliberately constrain a network, forcing it to learn a compressed representation of the input. It’s a classic example of bottlenecking, which is incredibly common in deep learning models!

Mathematical Representation of Autoencoders

If we represent the encoder as \(f_\theta(\cdot)\) and the decoder as \(g_\phi(\cdot)\), then the autoencoder’s reconstruction output is \(\mathbf{x'} = g_\phi(f_\theta(\mathbf{x}))\), where \(\mathbf{x}\) is the input.

Training an autoencoder involves reducing the error between the input \(\mathbf{x}\) and the reconstructed output \(\mathbf{x'}\):

\[ L(\mathbf{x}, g_\phi(f_\theta(\mathbf{x}))) = L(\mathbf{x}, \mathbf{x'}) \]

A very simple loss function for the vanilla autoencoder is the mean-squared error. \[ L(\theta, \phi) = \frac{1}{n}\sum_{i}^{n}({\mathbf{x}}_i - g_\phi(f_\theta(\mathbf{x}_i)))^2 \]

Replicating PCA

The vanilla autoencoder can be used as a dimensionality reduction method that can outperform ‘traditional’ dimensionality reduction methods such as principal component analysis (PCA) or multi-dimensional scaling (MDS) for complex tasks (Fournier and Aloise 2019).

Or, in a special case, the autoencoder can be a way to mimic the PCA. If the autoencoder does not use any non-linear activation functions, then it can be used to perform an operation similar to PCA.

Why bother looking at PCA at all? Well, PCA manages to satisfy the property that the decoder’s weights are symmetric to those of the encoder.

PCA Recap

Let’s say we have a matrix \(\mathbf{Y} \in \mathbb{R}_{n\times N}\) \[ \mathbf{Y} = \begin{bmatrix} | & & | \\ \mathbf{y}_1 & ... & \mathbf{y}_N \\ | & & | \\ \end{bmatrix} \]

The mean-centered matrix is \(\mathbf{Y}_0 = \mathbf{Y} - \frac{1}{N}\mathbf{Y}\unicode{x1D7D9}_{N\times N}\) \[ \mathbf{Y}_0 = \begin{bmatrix} | & & | \\ \mathbf{y}_1 & ... & \mathbf{y}_N \\ | & & | \\ \end{bmatrix} - \frac{1}{N} \begin{bmatrix} | & & | \\ \mathbf{y}_1 & ... & \mathbf{y}_N \\ | & & | \\ \end{bmatrix} \begin{bmatrix} | & & | \\ 1 & ... & 1 \\ | & & | \\ \end{bmatrix} \]

PCA can be used as a dimensionality reduction method, where we first encode the data with a function \(f(\cdot)\), then to recover the original data, we can decode with another linear transformation \(g(\cdot)\), equivalent to our discussion of autoencoders above.

What’s so PCA about this? Well, PCA introduces a couple of assumptions or constraints that make it a unique solution.

  • The transformation is linear.
  • The principal components are orthogonal.

\[ r(\mathbf{Y_0}) = g(f(\mathbf{Y}_0)) \]

In (Goodfellow, Bengio, and Courville 2016), the authors show that if we were to use \(\mathbf{D} \in \mathbb{R}_{n\times l}\) as the decoding matrix in place of \(g(\cdot)\), then the optimal encoding matrix is \(\mathbf{D}^T\) (in place of \(f(\cdot)\)). This blogger has an even clearer derivation of it.

The PCA reconstruction then looks like: \[ r(\mathbf{Y_0}) = \mathbf{D}\mathbf{D}^T\mathbf{Y}_0 \]

To learn the optimal matrix \(\mathbf{D}\), we can minimize the \(L^2\) distance between the input and the reconstruction \[ \operatorname*{arg\,min}_{D} \parallel \mathbf{Y}_0 - \mathbf{D}\mathbf{D}^T\mathbf{Y}_0 \parallel_F \]

To make things ‘nicer’, we can switch to the squared euclidean norm. \[ \operatorname*{arg\,min}_{D} \parallel \mathbf{Y}_0 - \mathbf{D}\mathbf{D}^T\mathbf{Y}_0 \parallel_F^2 \]

The optimization above doesn’t have a unique solution. So, what really ‘makes’ the PCA is the assumption that \(\mathbf{D}\) must have orthonormal vectors, leading to the formulation: \[ \operatorname*{arg\,min}_{D} \parallel \mathbf{Y}_0 - \mathbf{D}\mathbf{D}^T\mathbf{Y}_0 \parallel_F^2 \qquad \text{such that} \quad \mathbf{D}^T\mathbf{D} = \mathbf{I}_{l\times l} \]

According to (Shlens 2014), the “true” reason we set this assumption is because “there exists an efficient, analytical solution to this problem.” What that means is that the eigendecomposition and SVD methods, typically used to solve this problem, can exploit the orthonormality of the transformation matrix to provide a calculable solution.

So, Autencoders

The hidden layer can be described as \[ \mathbf{x}_i = \mathbf{W}_1\mathbf{y}_i + \mathbf{b}_1\mathbf{u}^T \]

And, the output layer as \[ \mathbf{\bar{y}}_i = \mathbf{W}_2\mathbf{x}_i + \mathbf{b}_2\mathbf{u}^T \]

Setting up the loss function with the MSE loss, we get: \[ J = \operatorname*{arg\,min}_{\mathbf{W}_1, \mathbf{W}_2, \mathbf{b}_1, \mathbf{b}_2} \parallel \mathbf{Y} - \mathbf{W}_2(\mathbf{W}_1\mathbf{Y} + \mathbf{b}_1\mathbf{u}^T) - \mathbf{b}_2\mathbf{u}^T \parallel_F^2 \]

where \(\mathbf{W}_1,\mathbf{W}_2 \in \mathbb{R}^{p\times n}\) are the weight matrices of the neural network, \(\mathbf{b}_1, \mathbf{b}_2 \in \mathbb{R}^{p\times 1}\) are the bias vectors, \(\mathbf{u}^T\) is a \(1\times n\) vector of all ones, and \(\mathbf{Y} \in \mathbb{R}^{n\times N}\) is the stack of all input vectors.

If we take the partial derivative of the loss function, with respect to \(\mathbf{b}_2\)

\[ \begin{align*} \frac{\partial J}{\partial \mathbf{b}_2} &= -2( \mathbf{Y} - \mathbf{W}_2(\mathbf{W}_1\mathbf{Y} + \mathbf{b}_1\mathbf{u}^T) - \mathbf{b}_2\mathbf{u}^T ) \mathbf{u} \\ &= -2( \mathbf{Y} - \mathbf{W}_2(\mathbf{W}_1\mathbf{Y} + \mathbf{b}_1\mathbf{u}^T) ) \mathbf{u} + 2\mathbf{b}_2\mathbf{u}^T\mathbf{u} \\ &= -2( \mathbf{Y} - \mathbf{W}_2(\mathbf{W}_1\mathbf{Y} + \mathbf{b}_1\mathbf{u}^T) ) \mathbf{u} + 2\mathbf{b}_2N \end{align*} \]

Setting the derivative equal to zero, to find the optimal bias vector:

\[ \begin{align*} 0 &= -2( \mathbf{Y} - \mathbf{W}_2(\mathbf{W}_1\mathbf{Y} + \mathbf{b}_1\mathbf{u}^T) ) \mathbf{u} + 2\mathbf{b}_2N \\ -2\mathbf{b}_2N &= -2( \mathbf{Y} - \mathbf{W}_2(\mathbf{W}_1\mathbf{Y} + \mathbf{b}_1\mathbf{u}^T) ) \mathbf{u} \\ \mathbf{b}_2 &= \frac{1}{N}( \mathbf{Y} - \mathbf{W}_2(\mathbf{W}_1\mathbf{Y} + \mathbf{b}_1\mathbf{u}^T) )\mathbf{u} \end{align*} \]

Plugging this solution back into the original loss function: \[ J = \operatorname*{arg\,min}_{\mathbf{W}_1, \mathbf{W}_2, \mathbf{b}_1, \mathbf{b}_2} \parallel \mathbf{Y} - \mathbf{W}_2(\mathbf{W}_1\mathbf{Y} + \mathbf{b}_1\mathbf{u}^T) - \frac{1}{N}( \mathbf{Y} - \mathbf{W}_2(\mathbf{W}_1\mathbf{Y} + \mathbf{b}_1\mathbf{u}^T) )\mathbf{u}\mathbf{u}^T \parallel_F^2 \]

Rearranging to put the like terms together: \[ \begin{align*} J &= \operatorname*{arg\,min}_{\mathbf{W}_1, \mathbf{W}_2, \mathbf{b}_1, \mathbf{b}_2} \parallel ( \mathbf{Y} - \frac{1}{N}(\mathbf{Y})\mathbf{u}\mathbf{u}^T ) + ( -\mathbf{W}_2\mathbf{W}_1\mathbf{Y} + \frac{1}{N}(\mathbf{W}_2\mathbf{W}_1\mathbf{Y})\mathbf{u}\mathbf{u}^T ) + ( -\mathbf{W}_2\mathbf{b}_1\mathbf{u}^T + \frac{1}{N}(\mathbf{W}_2\mathbf{b}_1\mathbf{u}^T)\mathbf{u}\mathbf{u}^T ) \parallel_F^2 \\ &= \operatorname*{arg\,min}_{\mathbf{W}_1, \mathbf{W}_2, \mathbf{b}_1, \mathbf{b}_2} \parallel ( \mathbf{Y} - \frac{1}{N}(\mathbf{Y})\mathbf{u}\mathbf{u}^T ) + ( -\mathbf{W}_2\mathbf{W}_1\mathbf{Y} + \frac{1}{N}(\mathbf{W}_2\mathbf{W}_1\mathbf{Y})\mathbf{u}\mathbf{u}^T ) + ( -\mathbf{W}_2\mathbf{b}_1\mathbf{u}^T + \frac{1}{N}(\mathbf{W}_2\mathbf{b}_1)N\mathbf{u}^T ) \parallel_F^2 \\ &= \operatorname*{arg\,min}_{\mathbf{W}_1, \mathbf{W}_2, \mathbf{b}_1, \mathbf{b}_2} \parallel ( \mathbf{Y} - \frac{1}{N}(\mathbf{Y})\mathbf{u}\mathbf{u}^T ) + ( -\mathbf{W}_2\mathbf{W}_1\mathbf{Y} + \frac{1}{N}(\mathbf{W}_2\mathbf{W}_1\mathbf{Y})\mathbf{u}\mathbf{u}^T ) + ( -\mathbf{W}_2\mathbf{b}_1\mathbf{u}^T + \mathbf{W}_2\mathbf{b}_1\mathbf{u}^T ) \parallel_F^2 \\ &= \operatorname*{arg\,min}_{\mathbf{W}_1, \mathbf{W}_2, \mathbf{b}_1, \mathbf{b}_2} \parallel ( \mathbf{Y} - \frac{1}{N}(\mathbf{Y})\mathbf{u}\mathbf{u}^T ) + ( -\mathbf{W}_2\mathbf{W}_1\mathbf{Y} + \frac{1}{N}(\mathbf{W}_2\mathbf{W}_1\mathbf{Y})\mathbf{u}\mathbf{u}^T ) \parallel_F^2 \\ &= \operatorname*{arg\,min}_{\mathbf{W}_1, \mathbf{W}_2, \mathbf{b}_1, \mathbf{b}_2} \parallel ( \mathbf{Y} - \frac{1}{N}(\mathbf{Y})\mathbf{u}\mathbf{u}^T ) + ( -\mathbf{W}_2\mathbf{W}_1\mathbf{Y} + \mathbf{W}_2\mathbf{W}_1\frac{1}{N}(\mathbf{Y})\mathbf{u}\mathbf{u}^T ) \parallel_F^2 \\ &= \operatorname*{arg\,min}_{\mathbf{W}_1, \mathbf{W}_2, \mathbf{b}_1, \mathbf{b}_2} \parallel ( \mathbf{Y} - \frac{1}{N}(\mathbf{Y})\mathbf{u}\mathbf{u}^T ) + ( \mathbf{W}_2\mathbf{W}_1 ( -\mathbf{Y} + \frac{1}{N}(\mathbf{Y})\mathbf{u}\mathbf{u}^T ) ) \parallel_F^2 \\ &= \operatorname*{arg\,min}_{\mathbf{W}_1, \mathbf{W}_2, \mathbf{b}_1, \mathbf{b}_2} \parallel ( \mathbf{Y} - \frac{1}{N}(\mathbf{Y})\mathbf{u}\mathbf{u}^T ) - ( \mathbf{W}_2\mathbf{W}_1 ( \mathbf{Y} - \frac{1}{N}(\mathbf{Y})\mathbf{u}\mathbf{u}^T ) ) \parallel_F^2 \end{align*} \]

Subtracting the term \(\frac{1}{N}\mathbf{Y}\mathbf{u}^T\mathbf{u}\) from the original matrix \(\mathbf{Y}\) returns the mean-centered matrix, so it can be represented as \(\bar{\mathbf{Y}}\)

\[ \begin{align*} J &= \operatorname*{arg\,min}_{\mathbf{W}_1, \mathbf{W}_2} \parallel \bar{\mathbf{Y}}- \mathbf{W}_2\mathbf{W}_1\bar{\mathbf{Y}} \parallel_F^2 \end{align*} \]

The implication here is that with the optimal \(\mathbf{b}_2\) bias vector, the problem is no longer dependent on the bias vector \(\mathbf{b}_1\) in the hidden layer. The problem now also looks a lot like the PCA optimization problem.

While PCA uses eigendecomposition or SVD to solve for the weights, with the understanding that the vectors are orthonormal and both matrices are transposes, backpropagation in the linear autoencoder learns the weights without those constraints.

So, it’s not hard to imagine that the weights learned are not going to be the same.

What Can Linear Autoencoders (Not) Learn?

However, according to the Eckart-Young-Mirsky theorem, we know that the optimal k-rank approximation of a matrix is given by the projection of our data matrix onto a subspace spanned by its top \(k\) principal directions.

For our linear autoencoder (LAE), if our SGD-based optimization finds the global minima, this should mean that:

\[ \begin{align*} \mathbf{W}_2\mathbf{W}_1\bar{\mathbf{Y}} &= \mathbf{U}_k\mathbf{\Sigma}_k\mathbf{V}_k^T \end{align*} \]

Plugging in the truncated SVD of \(\bar{\mathbf{Y}} = \mathbf{U}\mathbf{\Sigma}\mathbf{V}^T\)

\[ \begin{align*} \mathbf{W}_2\mathbf{W}_1\bar{\mathbf{Y}}\bar{\mathbf{Y}}^T &= \mathbf{U}_k\mathbf{\Sigma}_k\mathbf{V}_k^T\bar{\mathbf{Y}}^T \\ \mathbf{W}_2\mathbf{W}_1 \mathbf{U}\mathbf{\Sigma}\mathbf{V}^T (\mathbf{U}\mathbf{\Sigma}\mathbf{V}^T)^T &= \mathbf{U}_k\mathbf{\Sigma}_k\mathbf{V}_k^T (\mathbf{U}\mathbf{\Sigma}\mathbf{V}^T)^T & \text{Plug in SVD of }\bar{\mathbf{Y}} \\ \mathbf{W}_2\mathbf{W}_1 \mathbf{U}\mathbf{\Sigma}\mathbf{V}^T \mathbf{V}\mathbf{\Sigma}\mathbf{U}^T &= \mathbf{U}_k\mathbf{\Sigma}_k\mathbf{V}_k^T \mathbf{V}\mathbf{\Sigma}\mathbf{U}^T & \text{Transpose} \\ \mathbf{W}_2\mathbf{W}_1 \mathbf{U}\mathbf{\Sigma}\mathbf{\Sigma}\mathbf{U}^T &= \mathbf{U}_k\mathbf{\Sigma}_k [I_{k\times k};0_{k\times n-k}] \mathbf{\Sigma}\mathbf{U}^T & \mathbf{V}^T\mathbf{V} = I \\ \mathbf{W}_2\mathbf{W}_1 \mathbf{U}\mathbf{\Sigma}^2\mathbf{U}^T &= \mathbf{U}_k [\mathbf{\Sigma}^2_{k};0]_{n\times n}\mathbf{U}^T & \text{Simplify} \\ \mathbf{W}_2\mathbf{W}_1 \mathbf{U}\mathbf{\Sigma}^2\mathbf{U}^T\mathbf{U} &= \mathbf{U}_k [\mathbf{\Sigma}_k^2;0]_{n\times n}\mathbf{U}^T\mathbf{U} & \text{Right multiply with } \mathbf{U} \\ \mathbf{W}_2\mathbf{W}_1 \mathbf{U}\mathbf{\Sigma}^2 &= \mathbf{U}_k [\mathbf{\Sigma}_k^2;0]_{n\times n} & \mathbf{U}^T\mathbf{U} = I \\ \mathbf{W}_2\mathbf{W}_1 \mathbf{U}\mathbf{\Sigma}^2(\mathbf{\Sigma}^2)^{-1} &= \mathbf{U}_k [\mathbf{\Sigma}_k^2;0]_{n\times n}(\mathbf{\Sigma}^2)^{-1} & \text{Right multiply with } (\mathbf{\Sigma}^2)^{-1} \\ \mathbf{W}_2\mathbf{W}_1 \mathbf{U} &= \mathbf{U}_k [I_k;0]_{n\times n} & \text{Simplify} \\ \mathbf{W}_2\mathbf{W}_1 \mathbf{U}\mathbf{U}^T &= \mathbf{U}_k [I_k;0]_{n\times n}\mathbf{U}^T & \text{Right multiply with } \mathbf{U}^T \\ \mathbf{W}_2\mathbf{W}_1 &= \mathbf{U}_k [I_k;0]_{n\times n}\mathbf{U}^T & \mathbf{U}\mathbf{U}^T = I \\ \mathbf{W}_2\mathbf{W}_1 &= \mathbf{U}_k \mathbf{U}^T_k \end{align*} \]

Does this necessarily mean that \(\mathbf{W}_2 = \mathbf{U}_k\), \(\mathbf{W}_1 = \mathbf{U}^T_k\)?

Let’s run a quick experiment to check if it’s true.

Spinning up an Example

Similar to the experiment demonstrated in (Plaut 2018), we run PCA on a subset of the MNIST data and plot the top 8 left singular vectors. We also load up a pre-trained linear autoencoder (with 16 neurons in a single hidden layer) and plot the first 8 columns of \(\mathbf{W}_2\). If our derivation above is correct, then the two matrices should be the same upto the dimensionality of \(\mathbf{W}_2\).

Code
import matplotlib.pyplot as plt 
import numpy as np
import pandas as pd

from torchvision import datasets
import torchvision.transforms as transforms
import torch
from torch import nn
from torch.utils.data import Subset, DataLoader

newshape = 28
train_dataset = datasets.MNIST(
    root='data',
    train=True,
    download=True, 
    transform=transforms.Compose(
        [transforms.Resize(newshape), 
         transforms.ToTensor(), 
         transforms.Normalize((0.1307,), (0.3081,))
        ]
    )
)

# Data preparation
subset_train = Subset(train_dataset, indices=range(len(train_dataset) // 50))
train_loader = DataLoader(subset_train, batch_size=len(subset_train))
images = next(iter(train_loader))[0].numpy().reshape(len(subset_train),newshape, newshape)
labels = next(iter(train_loader))[1].numpy()
data = images.reshape(len(subset_train), newshape*newshape)
num_images = len(images)

class Autoencoder(nn.Module):
    def __init__(self, shape):
        super().__init__()
        self.shape = shape
        latent = 16

        # Encoder Linear -> ReLU
        self.encoder = nn.Sequential(
            nn.Linear(shape, latent),
        )
        # Decoder Linear -> Sigmoid
        self.decoder = nn.Sequential(
            nn.Linear(latent, shape), 
        )

    def forward(self, x):
        # Reconstruct input
        encoding = self.encoder(x)
        reconstruction = self.decoder(encoding)
        return reconstruction.view(-1, self.shape), encoding


col_range = 8

# PCA
u, s, vh = np.linalg.svd(data.T, full_matrices=True)
pca_lsv_reshaped = []
for i in range(0,col_range):
  pca_lsv_reshaped.append(u[:,i].reshape(28,28))

# Trained LAE
num_features = data.shape[1]
model = Autoencoder(shape=num_features)
_ = model.load_state_dict(torch.load('./mnist_lae.pt', map_location="cpu"))
w1 = model.encoder[0].weight.detach().numpy()
w2 = model.decoder[0].weight.detach().numpy()

# SVD on model weights W2
u2, s2, vh2 = np.linalg.svd(w2, full_matrices=True)

lae_w2_reshaped = []
for i in range(0, col_range):
  lae_w2_reshaped.append(w2[:,i].reshape(28,28))

# Theoretically, W_2 = U_k O_kxk
# and W_1 = tranposed(O_kxk) tranposed(U_k)
# We can get O_kxk, then do math to get U_k from W_1
ortho = np.linalg.inv(u2)@w2
w1_loading = np.transpose(ortho@w1)

lae_rsv_encoder_reshaped = []
lae_lsv_decoder_reshaped = []

for i in range(0,col_range):
  lae_rsv_encoder_reshaped.append(w1_loading[:,i].reshape(28,28))
  lae_lsv_decoder_reshaped.append(u2[:,i].reshape(28,28))

Very quickly, scrolling through the indices, it becomes obvious that the \(\mathbf{W}_2\) matrix is a bit off compared to the left singular vectors of the data matrix. There is something else at play here.

Invariance and Regularization

Okay, so to revisit our question, is it true that \(\mathbf{W}_2 = \mathbf{U}_k\), \(\mathbf{W}_1 = \mathbf{U}^T_k\)?

Nope. Why? Because the MSE loss we are using is “invariant under the linear group of invertible \(k \times k\) matrices” (Broad Institute 2019). Written a different way, this means that our LAE might have learned:

\[ \mathbf{W}_2 = \mathbf{U}_k\mathbf{G}_{k\times k} \text { and } \mathbf{W}_1 = \mathbf{G}^{-1}_{k\times k}\mathbf{U}^T_k \]

where \(\mathbf{G}_{k\times k}\) is any invertible \(k \times k\) matrix.

Figure from (Kunin et al. 2019a). The authors use X to represent the input, whereas we have been using Y in all of the derivations above.

But, we may still be able to recover the left singular vectors of the dataset from our decoder. If we take a closer look at the structure of \(\mathbf{W}_2 = \mathbf{U}_k\mathbf{G}_{k\times k}\), we just have to somehow extract the matrix on the left!

In fact, there is a very important detail (which we skipped) which allows us to easily recover \(\mathbf{U}_k\). The pre-trained LAE in this article was trained with L2 sum regularization, such that the loss function actually looks like this:

\[ J_\sigma = \operatorname*{arg\,min}_{\mathbf{W}_1, \mathbf{W}_2} \parallel \bar{\mathbf{Y}}- \mathbf{W}_2\mathbf{W}_1\bar{\mathbf{Y}} \parallel_F^2 + \lambda ( \parallel \mathbf{W}_1 \parallel_F^2 + \parallel \mathbf{W}_2 \parallel_F^2 ) \]

Implemented in Python:

def train_autoencoder(data, epochs, model):
    optimizer = torch.optim.Adam(model.parameters(), lr=5e-4)
    criterion = nn.MSELoss()

    train_loader = DataLoader(data, batch_size=2)

    for epoch in range(epochs):
      loss = 0
      for batch_features in train_loader:
          batch_features = batch_features.to(device)
          optimizer.zero_grad()
          outputs, _ = model(batch_features)
          reg = 0
          w1_norm = torch.norm(model.encoder[0].weight, p='fro')
          w2_norm = torch.norm(model.decoder[0].weight, p='fro')
          reg = 5e-4*(w1_norm**2 + w2_norm**2)
          train_loss = criterion(outputs, batch_features) + reg
          train_loss.backward()
          optimizer.step()
          loss += train_loss.item()
      
      loss = loss / len(train_loader)

This detail is important because, while the MSE loss is invariant under the general linear group, the specific regularization expression used above (sum regularization) is not invariant under the general linear group.

To elaborate, when minimizing the term \(\parallel \mathbf{W}_2 \parallel_F^2 = \parallel \mathbf{U}_k\mathbf{G}_{k\times k} \parallel_F^2\), different matrices \(\mathbf{G}_{k\times k}\) do not provide the same result. Using the definition of the squared Frobenius norm, it is clear that

\[ Tr(\mathbf{U}_k\mathbf{U}_k^T) \neq Tr(\mathbf{U}_k\mathbf{G}_{k\times k}\mathbf{G}_{k\times k}^T\mathbf{U}_k^T) \]

Now, if we were using a different regularization term, say \(\parallel \mathbf{W}_2\mathbf{W}_1 \parallel_F^2\) (product regularization), our loss function would remain invariant under the general linear group. Again, using the defintion of the squared Frobenius norm \[ Tr(\mathbf{U}_k\mathbf{U}_k^T) = Tr( \mathbf{U}_k\mathbf{G}_{k\times k}\mathbf{G}_{k\times k}^{-1}\mathbf{U}_k^T \mathbf{U}_k\mathbf{G}_{k\times k}^T\mathbf{G}_{k\times k}^{-T}\mathbf{U}_k ) \]

However, there is a group that sum regularization remains invariant under. If the matrix \(\mathbf{G}_{k\times k}\) were specifically an orthogonal matrix \(\mathbf{O}_{k\times k}\), then the following would hold:

\[ Tr(\mathbf{U}_k\mathbf{U}_k^T) = Tr(\mathbf{U}_k\mathbf{O}_{k\times k}\mathbf{O}_{k\times k}^T\mathbf{U}_k^T) \]

So, the loss function used to train the LAE here is invariant under the orthogonal group. As a result, we are left with the task of extracting \(\mathbf{U}_k\) from the matrix \(\mathbf{W}_2 = \mathbf{U}_k\mathbf{O}_{k\times k}\), which we can do with an SVD, while ensuring that our obtained left singular vectors will be \(\mathbf{U}_k\). In the case our loss function was invariant under the general linear group, there was no guarantee that an SVD to obtain the left singular vectors would result in \(\mathbf{U}_k\).

So, for our sum-regularization trained LAE, instead of looking at \(\mathbf{W}_2\), let’s take a look at the left singular vectors of \(\mathbf{W}_2\):

Way better! Not perfect, but we can chalk that up to bad hyperparameter selection and not-so-great convergence. We can even recover an approximation of \(\mathbf{U}_k\) from \(\mathbf{W}_1\). If we’re assuming that \(\mathbf{W}_1 = \mathbf{O}_{k\times k}^T\mathbf{U}_k^T\), then we can left multiply by \(\mathbf{O}_{k\times k}\) to obtain \(\mathbf{U}_k^T\)

\[ \begin{align} \mathbf{W}_1 &= \mathbf{O}_{k\times k}^T\mathbf{U}_k^T \\ \mathbf{O}_{k\times k}\mathbf{W}_1 &= \mathbf{O}_{k\times k}\mathbf{O}_{k\times k}^T\mathbf{U}_k^T \\ \mathbf{O}_{k\times k}\mathbf{W}_1 &= \mathbf{U}_k^T \end{align} \]

Plotting the transpose of this result, we obtain:

Why This Might be Cool?

As pointed out in (Kunin et al. 2019b), given our input data \(\mathbf{X}\) is \(m \times n\) dimensional, we just carried out PCA by training an LAE and applying a SVD on a \(m \times k\) dimensional matrix, where \(k << n\). Otherwise, in the typical approach, we would have had to carry out an SVD on the original \(m \times n\) dimensional matrix. Maybe there is some merit here? It would only make sense to do this if we determined that training the LAE and conducting the SVD on a smaller matrix was a cheaper operation than conducting SVD on the original matrix.

Last updated

2024-05-17 07:02:10 UTC

Corrections

If you see mistakes or want to suggest changes, please create an issue on the source repository. Suggestions are appreciated!

Reuse

Generated text and figures are licensed under Creative Commons Attribution CC BY 4.0. The raw article and it’s contents are available at on Github, unless otherwise noted. The figures that have been reused from other sources don’t fall under this license and can be recognized by a note in their caption: ‘Figure from …’

References

Aljundi, Rahaf, Francesca Babiloni, Mohamed Elhoseiny, Marcus Rohrbach, and Tinne Tuytelaars. 2017. “Memory Aware Synapses: Learning What (Not) to Forget. CoRR Abs/1711.09601 (2017).” arXiv Preprint arXiv:1711.09601.
Bengio, Yoshua, Dong-Hyun Lee, Jorg Bornschein, Thomas Mesnard, and Zhouhan Lin. 2015. “Towards Biologically Plausible Deep Learning.” arXiv Preprint arXiv:1502.04156.
Broad Institute. 2019. “MIA: Daniel Kunin, Jon Bloom, Aleks Goeva, Cotton Seed. Topology of Regularized Linear Autoencoders.” https://www.youtube.com/watch?v=1mvwCmqd6zQ&list=PLlMMtlgw6qNjROoMNTBQjAcdx53kV50cS&t=3480s.
Fournier, Quentin, and Daniel Aloise. 2019. “Empirical Comparison Between Autoencoders and Traditional Dimensionality Reduction Methods.” In 2019 IEEE Second International Conference on Artificial Intelligence and Knowledge Engineering (AIKE), 211–14. IEEE.
Goodfellow, Ian J., Yoshua Bengio, and Aaron Courville. 2016. Deep Learning. Cambridge, MA, USA: MIT Press.
Kunin, Daniel, Jonathan Bloom, Aleksandrina Goeva, and Cotton Seed. 2019b. “Loss Landscapes of Regularized Linear Autoencoders.” In International Conference on Machine Learning, 3560–69. PMLR.
———. 2019a. “Loss Landscapes of Regularized Linear Autoencoders.” PMLR. https://icml.cc/media/Slides/icml/2019/halla(13-16-00)-13-16-40-4987-loss_landscapes.pdf.
Lillicrap, Timothy P, Adam Santoro, Luke Marris, Colin J Akerman, and Geoffrey Hinton. 2020. “Backpropagation and the Brain.” Nature Reviews Neuroscience 21 (6): 335–46.
Nielsen, Michael A. 2015. Neural Networks and Deep Learning. Vol. 25. Determination press San Francisco, CA, USA.
Plaut, Elad. 2018. “From Principal Subspaces to Principal Components with Linear Autoencoders.” arXiv Preprint arXiv:1804.10253.
Shlens, Jonathon. 2014. “A Tutorial on Principal Component Analysis.” arXiv Preprint arXiv:1404.1100.

Citation

BibTeX citation:
@online{aswani2024,
  author = {Aswani, Nishant},
  title = {Discussing {PCA} and {Linear} {Autoencoders}},
  date = {2024-05-15},
  url = {https://nishantaswani.com/articles/autoencoders.html},
  langid = {en}
}
For attribution, please cite this work as:
Aswani, Nishant. 2024. “Discussing PCA and Linear Autoencoders.” May 15, 2024. https://nishantaswani.com/articles/autoencoders.html.