Note: I begin this blog post by recapping some of the linear algebra shown in Gregory Gundersen’s and Jeremy Kun’s blog posts. The rest of it is stuff I learned in a Neural Statistics class at NYU CNS. You’ll also find these ideas in the MLRefined book.
Groundwork
In linear algebra, everybody says that our problems look like \(A\mathbf{x} = \mathbf{b}\), where we are interested in solving for one of the values above.
For example, maybe \(\mathbf{x}\) is a single channel image and we want to rotate it by a certain angle. In that case,you could craft a matrix \(A\) that would help you transport each pixel, such that you end up with a rotated image \(\mathbf{b}\). The matrix transformation\(A\) is an operation that brings you from one vector to another.
Matrices as linear operators.
A very common mantra is that matrix transformations are just a linear thing. Again, our matrix \(A\) is a linear transformation that will bring us from \(\mathbf{x}\) to \(\mathbf{b}\). In Figure 1, our original vector is \(\mathbf{x} = \begin{pmatrix} 1 \ 1 \end{pmatrix}^T\) (not visualized) and the resultant vector \(\mathbf{b}\) is visualized in green.
It’s linear because when you mess around in Figure 1, the grid lines are always straight! The grid doesn’t go all noodly on you. Also, try getting a feel for how you can rotate the green arrow!
We also plot the columns our matrix \(A\), in blue and red, where each column is displayed as a basis vector. At initialization, the columns are the standard basis vectors \(\begin{bmatrix}1 \quad 0 \\ 0 \quad 1\end{bmatrix}\) that we are familiar with. And, when that is the case, then \(\mathbf{x} = \mathbf{b}\).
import { Matrix, mat_mult } from"@nstrayer/simple-matrices"import {generateGridLines} from"@niniack/dimred"// Create matrix from variablestransformation =newMatrix([2,2], [first_input.a11, first_input.a12, first_input.a21, first_input.a22])x =newMatrix([2,1], [1,1])output =mat_mult(transformation, x)// Transforming matrix to an array of objects for plottingplot_matrix = [ { x: transformation.get(0,0),y: transformation.get(1,0),color:"royalblue" },// First column in steelblue { x: transformation.get(0,1),y: transformation.get(1,1),color:"crimson" } // Second column in red]// Create the plot_vector in a separate cellplot_vector = [{x: output.get(0,0),y: output.get(1,0)}]gridLines =generateGridLines(transformation, [-5,5],1);
The typical mathematical representation of what is happening above looks like \[
\begin{bmatrix}
\textcolor{royalblue}{a} & \textcolor{crimson}{b} \\
\textcolor{royalblue}{c} & \textcolor{crimson}{d}
\end{bmatrix}
\begin{bmatrix}
x \\
y
\end{bmatrix}
=
x\begin{bmatrix}
\textcolor{royalblue}{a} \\
\textcolor{royalblue}{c}
\end{bmatrix}
+
y\begin{bmatrix}
\textcolor{crimson}{b} \\
\textcolor{crimson}{d}
\end{bmatrix}
=
\begin{bmatrix}
\textcolor{green}{ax + by}\\
\textcolor{green}{cx + dy}
\end{bmatrix}.
\]
Intuitively, it looks like we are “remixing” (linearly combining) our red and blue vectors to come up with a new green vector. Going back to Figure 1, remember that \(\mathbf{x} = \begin{pmatrix} 1 \ 1 \end{pmatrix}^T\). So, \(x\) and \(y\) are fixed, but we are picking the column vectors of \(A\) that will be remixed. It is a bit weird, we are changing the map we want to use.
Diagonal matrix
With this remixing idea in mind, let’s assume a diagonal matrix. Then, while linearly combining the two vectors, we are building a new vector which has its components still independently controlled by the two original vectors. Basically, \(\textcolor{royalblue}{a}x\) has nothing from the red vector and \(\textcolor{crimson}{d}y\) has nothing from the blue one.
We’ve just been building one 2D object. However, we can be a little more efficient, and build several 2D objects simultaneously, each a different remix of our original vectors. \[
\begin{bmatrix}
\textcolor{royalblue}{a} & \textcolor{crimson}{b} \\
\textcolor{royalblue}{c} & \textcolor{crimson}{d}
\end{bmatrix}
\begin{bmatrix}
x_1 \quad x_2 \\
y_1 \quad y_2
\end{bmatrix}
=
\begin{bmatrix}
x_1 \quad x_2
\end{bmatrix}
\begin{bmatrix}
\textcolor{royalblue}{a} \\
\textcolor{royalblue}{c}
\end{bmatrix}
+
\begin{bmatrix}
y_1 \quad y_2
\end{bmatrix}
\begin{bmatrix}
\textcolor{crimson}{b} \\
\textcolor{crimson}{d}
\end{bmatrix}
=
\begin{bmatrix}
\textcolor{royalblue}{a}x_1 + \textcolor{crimson}{b}y_1 \quad \textcolor{royalblue}{a}x_2 + \textcolor{crimson}{b}y_2\\
\textcolor{royalblue}{c}x_1 + \textcolor{crimson}{d}y_1 \quad \textcolor{royalblue}{c}x_2 + \textcolor{crimson}{d}y_2
\end{bmatrix}
\]
In our resulting matrix, we neatly end up with two new sets of vectors. Both of them are a different remix of the red and blue vectors.
Framing matrix factorization.
You could also just stack a couple more transformation \(C\) and \(B\) to the left of \(A\). Of course, nothing stops them from “collapsing” into a single, final transformation that you could then apply to your vector. Sometimes, this deconstructed representation is useful. For example, you can execute a two-dimensional rotation by applying three shears consecutively. This linked blog post talks about why this deconstructed approach solves the “aliasing” problem encountered in rotations.
In fact, that’s why neural networks need “non-linearities” in between layers. Or, they’d just “collapse” into a single matrix multiplication!
In Figure 2, you can get a feel for why three shears give you a 2D rotation. We first shear only along the x-axis, so our blue arrow remains put; we now have a new set of basis vectors. Next, we do a shear along the standard y-axis. This has the effect of moving both of our new basis vectors (in the middle subfiture, both red and blue vectors have moved). I had a bit of trouble with this: why did both vectors move? After all, I have only sheared in one direction with Shear \(B\). But, you have to keep in mind that you are applying Shear \(B\) to the new set of basis vectors, where the red vector was a combination of the two standard basis vectors. Hence, you are inevitably going to affect both vectors when doing a second vertical shear.
Looking at only the green vector is pretty interesting too. As an exercise, fix the angle in Figure 2 to something like \(-100\) and look at the results. Focusing on just the green vector, it seems like there was an exclusively rightwards force, followed by an exclusively downwards force, and then an exclusively leftwards force, which all results in a rotation. Of course, these are precise shear values governed by trigonometric functions, i.e. doing a different downwards shear would not eventually produce a clean rotation. The amount of shear in each of the shear transformations are in tandem.
So far, we have been thinking about matrices as linear transformations. And, we can “decompose” our linear transformation into a set of simpler transformations, which could be handy (e.g. in 2D world, three shears give us a rotation).
In truth, that’s not how we always encounter \(A\). Typically, you have \(M\) features, which you have recorded \(N\) times, resulting in a \(M \times N\) matrix that you want to study. So, it’s not very obvious how \(A \in \mathbb{R}^{M \times N}\) is a transformation of any sort. There is a confusing, unspoken dual perspective on matrices that Jeremy Kun’s blog post captures very well:
The concern is the connection between matrices as transformations and matrices as a “convenient” way to organize data.
We seem to talk about \(A\) from these two different perspectives. When working with “data,” it is organized in a matrix \(A\) because that’s a natural way to present what we want to work with. On the other hand, a “transformation” is organized like that because it is a convenient way to represent a system of linear equations that we want to solve for some unknown value.
Okay, let’s try and grapple with that duality. We’ve got our data matrix \(A \in \mathbb{R}^{M \times N}\)
We have been thinking about each column as a vector spanning our space. Consequently, the number of rows \(M\) determines the dimensions of the \(N\) vectors. If \(N \ll M\), then we have fewer spanning vectors than there are dimensions, so we cannot fully describe the space. And if \(N \gg M\), we have an abundance of vectors, some of which might be linearly dependent, to describe the space.
Thinking a bit abstractly, our data matrix describes how a very specific “world” works. It is a “map” of what output you get when you chuck in an input, without exactly knowing what the transformation looks like. Real-world maps look a lot different, so the way I’m using “map” here is quite odd. Really, it’s as if you asked a bunch of super predictable, non-risk taking hikers where they ended up after an hour of hiking. If you wanted a decent understanding of the layout, you probably wouldn’t ask the same hiker to do the path \(N\) times. Instead, you’d get \(N\) very different people to do it, expecting that they would approach the terrain differently because they have varying personalities.
Ultimately though, we are sort of limited in our understanding of where the \(N+1 th\) hiker would end up because we only have \(N\) hikers to work with. Like we did with the green vector in Figure 1, if we want to build a new vector \(\mathbf{b}\), that is predicting the final location of hiker \(N+1\) , we would linearly combine the endpoints of the hikers we have on hand to get somewhere new. And, the weighting of this linear combination would be determined by an \(\mathbf{x}\). Maybe we’d try to quantify everyone’s personality and then when predicting the endpoint of \(N+1\), express her personality as a combination of the personalities of the \(N\) hikers we previously worked with. If hiker \(N+1\) was a lot like hiker 5, it’s not an outrageous guess that she would end up close to hiker 5’s endpoint.
Figure 3: Linear combinations of fixed bases. We have set M=2 and N=3.
There’s an important difference in what we are manipulating between Figure 1 and Figure 3. Earlier, we could pick out our spanning vectors but could not manipulate how we were combining them. Here, we have control over how we want to combine the vectors, which is what our input is, but the map itself is fixed.
Decomposing the linear map.
As described above, the linear “map” is how we see a specific world, so it is unsurprising that we try to understand how this map works. A neat way to do so is to decompose the map into simpler transformations, like we did with the rotation matrix.
Returning to the hiker analogy, it makes a lot more sense to construct the personality of hiker \(N+1\) as a combination of personality traits (goofy, lazy, inefficient, etc.) than it is to construct them as a combination of other hikers.
A very helpful fact is that if you changed the bases of your initial vector \(\mathbf{x}\) to give you \(\mathbf{x'}\) and your final vector \(\mathbf{b}\) to give you \(\mathbf{b'}\), then you could just interpret \(\mathbf{b'}\) as a re-scaled version of \(\mathbf{x'}\). Symbolically this looks like
It is even better that the change of bases, albeit different for \(\mathbf{x}\) and \(\mathbf{b}\), are actually just rotations. Then, we can rearrange this understanding to see the canonical form of the SVD.
A lot of what we want to with the SVD revolves around representing our original matrix \(A\) as a lighter, but still good enough, matrix \(\tilde{A}\). If you look at the proof of the SVD, the rotation matrices \(U\) and \(V\) materialize in a meaningful way — they are the eigenvectors of the \(AA^T\) and \(A^TA\) matrices, respectively. And, when I think of eigenvectors, I think of this quote from here:
I’ve always thought of eigenvectors as matrix bouillon, they distill the behavior of the matrix into an info dense broth
With that analogy, it isn’t all that surprising that eigenvectors make an appearance when talking about reducing the dimensions of a matrix. If you want a lighter broth, just use a smaller chunk of the bouillon cube.
Citation
BibTeX citation:
@online{aswani2025,
author = {Aswani, Nishant},
title = {A {Unified} {View} of {Linear} {Dimensionality} {Reduction}},
date = {2025-06-27},
url = {https://nishantaswani.com/articles/dimred/dimred.html},
langid = {en}
}