Technical Glossary (Never Complete)

A home for definitions.

Author

Affiliation

Nishant Aswani

NYU Abu Dhabi, NYU Tandon

Published

August 1, 2021

Interpretability

Definition 1 (Polysemanticity)

Definition 2 (The Superposition Hypothesis)

ML Fundamentals

Definition 3 (Empirical Risk Minimization) For a primer on \(\mathcal{Risk}\), visit this page.

Definition 4 (Dense Set) From the Geometric Deep Learning Book and Brilliant.

A subset \(A\) of a space \(X (A \subset X)\) is “dense” in \(X\) if it satisfies \[ A \cup \{\lim_{i \rightarrow \infty} a_i : a_i \in A\} = X. \]

This means that any point in \(X\) is arbitrarily close to a point in \(A\). A good example is the subset of rational numbers (subset \(A\)) in the set of real numbers (set \(X\)): any real number is either a rational number or there is a rational number very close to it.

The Universal Approximation Theorem shows that a two-layer perceptron, i.e. \(f(x) = c^T \sigma(Ax+b)\) is dense in the space of continuous functions on \(\mathbb{R}^d\).

Definition 5 (The Universal Approximation Theorem) From the Geometric Deep Learning Book.

Definition 6 Lipschitz Continuity

From this paper.

The Lipschitz continuity of a function refers to its smoothness as the input is perturbed. More accurately, it is the maximum absolute change in the output per unit norm changed in the input.

Given a function with \(dom(f) \subseteq \mathbb{R}^d\), that function is \(\beta\)-Lipschitz continuous w.r.t to some \(\alpha\) norm, if for all \(x,y \in dom(f)\) it satisfies

\[ \| f(x) - f(y) \|_\alpha \leq \beta \|x-y\|_\alpha \]

Typically, we are interested in the smallest \(\beta\). We want this property because it is a proxy for good generalization. Alternatively, we don’t want it to be too smooth because it would suggest a high bias and low variance, which can mean underfitting.

Linear Algebra

Definition 7 Frobenius Norm

The Frobenius norm is defined as follows:

\[ \|\mathbf{X}\|_{F} = \sqrt{\sum\limits_{i=1}^n \sum\limits_{j=1}^m |X_{ij}|^2}. \]

It can also be thought of as vectorizing the matrix and calculating the 2-norm.

Definition 8 Operator Norm

Definition 9 Eckhart Young Theorem

The optimal rank-r approximation to \(\mathbf{X}\), in a least-squares sense, is given by the rank-r SVD truncation \(\tilde{\mathbf{X}}\).

\[ \text{arg}\min\limits_{\tilde{\mathbf{X}}, \text{s.t. rank}(\tilde{\mathbf{X}}) = r}\ \| \mathbf{X} - \tilde{\mathbf{X}}\|_F = \tilde{\mathbf{U}}\tilde{\mathbf{\Sigma}}\tilde{\mathbf{V}*}. \]

Statistics

Definition 12 Expectation

Definition 13 Moment Generating Function

Notes come from here

The moment of something is the expected value of a power of the random variable:

\[ \mathbb{E}[X^k] = \int_{}^{} X^k \operatorname{pdf} dx. \] When we are using the moment generating function (MGF), we are trying to walk around this integral, by instead doing:

\[ \mathbb{E}[e^{tX}] = \int_{}^{} e^{tX} \operatorname{pdf} dx. \]

We do this because we use the Maclaurin series expansion to approximate the exponential function. The Maclaurin series is just the Taylor series but evaluated at 0.

\[ \begin{aligned} e^{tX} &= 1 + \frac{X}{1!}t + \frac{X^2}{2!}t^2 ... \\ \mathbb{E}[e^{tX}] &= 1 + \frac{\mathbb{E}[X]}{1!}t + \frac{\mathbb{E}[X^2]}{2!}t^2 ... \\ \end{aligned} \]

Looking for the first moment, one can take the first derivative of this expected value. Then, just set \(t\) to be 0, to discard all the remaining terms.

Conceptually then, we have access to a function that is very easy to expand into a sum of quantities. And, when you expand, then take the correct derivative of that expansion, it will give you the wanted moment. Hence, the term “generating.” This is what that looks like symbolically:

\[ M^{k}_X(0) = \frac{d^kM_X(t)}{dt^k} \rvert_{t=0}. \]

MGFs are neat because, it’s often easier to derive the MGF once (integrate exponential using LOTUS) and compute moments (differentiation), rather than constantly use the integral definition of expectation to get moments.

Definition 14 (Tower rule / the law of total expectation) The law of total expectation looks like \[ \mathbb{E}[X] = \mathbb{E}[\mathbb{E}[X|Y]] = \mathbb{E}[X|y_1]P(y_1) + \mathbb{E}[X|y_2]P(y_2) + \dots \]

Sometimes, we only know the expectation of a random variable conditioned on another. If we know the probability distribution of the variable we conditioned on, we can reconstruct the original expectation.

Definition 15 (Law of the unconcious statistician (LOTUS))

Definition 16 Wick’s Theorem https://en.wikipedia.org/wiki/Isserlis%27_theorem Wick’s probability theorem is a formula that allows one to compute higher-order moments of the multivariate normal distribution in terms of its covariance matrix

Definition 17 Cauchy Product http://mathonline.wikidot.com/cauchy-product-of-power-series

Definition 18 Linear Discriminant Analysis

How Did I Pass Undergrad?

Definition 19 Imaginary Numbers

The way to think about negative numbers is to think about “flips”. Multiplying \(x\) by a negative number models flipping between two states

\[ x, -x, x, -x, ... \]

An even better way is to think about negative numbers as a “180 degree rotation”. For that, imagine a plane with a real axis and a complex axis.

Code

import numpy as np
import matplotlib.pyplot as plt
from matplotlib.pyplot import figure

# Set up plot
fig, ax = plt.subplots(subplot_kw={'projection': 'polar'}, figsize=(4,4))

# Plot vectors
kw_dict = {'angles': 'xy', 'scale_units': 'xy', 'scale': 1.5}
ax.quiver(0, 0, 0, 1, color='black', **kw_dict)
ax.quiver(0, 0, np.pi / 2, 1, color='red', **kw_dict)
ax.quiver(0, 0, np.pi, 1, color='black', **kw_dict)
ax.quiver(0, 0, -np.pi / 2, 1, color='red', **kw_dict)

# Add annotations
ax.text(np.pi / 4, 0.5, r'$i$', ha='center', color='black')
ax.text(3 * np.pi / 4, 0.5, r'$i^2 = -1$', ha='center', color='black')
ax.text(5 * np.pi / 4, 0.5, r'$i^3 = -i$', ha='center', color='black')
ax.text(7 * np.pi / 4, 0.5, r'$i^4 = 1$', ha='center', color='black')


# Plot configuration
ax.set_rticks([])
ax.set_rmin(0)
ax.set_rmax(1)
ax.set_thetalim(-np.pi, np.pi)
ax.set_xticks(np.linspace(np.pi, -np.pi, 4, endpoint=False))

# Make transparent
plt.gcf().patch.set_alpha(0)

plt.show()

Figure 1: A polar plot in the complex plane

Then, multiplying with a complex number is like a “90 degree rotation”

\[ x, xi, -x, -xi, x, ... \]

Imaginary numbers can model things that rotate between between two dimensions, or have a periodic nature.

Definition 20 Euler’s Formula

We use imaginary numbers to model circular movement. The general Euler’s formula is

\[ e^{ix} = \cos{x} + i \sin{x} \]

The right hand side of this equation is describing a position on the complex plane using the Cartesian style.

If we consider varying just the first term \(\cos{x}\), we would be yoyo-ing on the real (horizontal) axis.
In the same way, the \(i\sin{x}\) term yoyos on the imaginary (vertical) axis. If you want to break it down further, you are yoyo-ing on the real axis \(\sin{x}\) and then rotating by 90 degress (multiplying by \(i\)).
The two yoyos are offset relative to each other (\(\sin{x} \text{vs.} \cos{x}\)). Summing the two terms, we can describe any point within a unit circle.

The left hand side of this equation is an imaginary exponential. Imaginary exponentials are weird.

The derivative of \(e^{ix}\) is \(\frac{de^{ix}}{x} = ie^{ix}\), which is a velocity vector. The velocity vector is perpendicular to the original vector, because multiplying any vector by \(i\), rotates it by \(\pi/2\).
A fact of vector calculus is that any vector \(\vec{v}\) consists of a direction vector \(\vec{d}\) and a magnitude \(\alpha\). In our case, the velocity vector has direction \(ie^{ix}\) and magnitude \(1\).
Because the velocity vector is always exactly perpendicular to the original vector and the magnitude never changes, it results in a constant rotation.

Overall, both sides of the equation describe a movement on a unit circle.

Code

# Set up plot
fig, ax = plt.subplots(subplot_kw={'projection': 'polar'}, figsize=(4,4))

# Plot vectors
len = 1
theta = np.pi/16
kw_dict = {'angles': 'xy', 'scale_units': 'xy', 'scale': 1}
ax.quiver(0, 0, 0, len, color='#0075ff', **kw_dict)
ax.quiver(0, len, theta, len/np.cos(theta)-len, color='green', **kw_dict)
ax.quiver(0, 0, theta, 1, color='black', **kw_dict)

# Annotate
ax.text(-theta, 0.6, r'$\vec{v}$', ha='center', color='#0075ff')
ax.text(theta/4, 1.2, r'$i\vec{v} \Delta x$', ha='center', color='green')

# Plot configuration
ax.set_rticks([])
ax.set_rmin(0)
ax.set_rmax(1)
ax.set_thetalim(-np.pi, np.pi)
ax.set_xticks(np.linspace(np.pi, -np.pi, 4, endpoint=False))

ax.set_rticks([0.5, 1, 1.5])
ax.grid(True)

# Make transparent
plt.gcf().patch.set_alpha(0)

plt.show()

Figure 2: A polar plot in the complex plane

To flesh out the rotation around a circle, Figure 2 visualizes a vector \(\vec{v}\) and a perpendicular vector \(i\vec{v}\) scaled by an arbitrary value \(\Delta x\). If we add those two vectors together \(\vec{v} + i\vec{v} \Delta x\), for a small \(\Delta x\) we get the new vector in black.

Citation

BibTeX citation:

@online{aswani2021,
  author = {Aswani, Nishant},
  title = {Technical {Glossary} {(Never} {Complete)}},
  date = {2021-08-01},
  url = {https://nishantaswani.com/articles/gloss/},
  langid = {en}
}

For attribution, please cite this work as:

Aswani, Nishant. 2021. “Technical Glossary (Never Complete).” August 1, 2021. https://nishantaswani.com/articles/gloss/.