Technical Glossary (Never Complete)

A home for definitions.

Author
Affiliation

NYU Abu Dhabi, NYU Tandon

Published

August 1, 2021

ML Fundamentals

Too long to fit here, visit this page.

From the Geometric Deep Learning Book and Brilliant.

A subset \(A\) of a space \(X (A \subset X)\) is “dense” in \(X\) if it satisfies \[ A \cup \{\lim_{i \rightarrow \infty} a_i : a_i \in A\} = X. \]

This means that any point in \(X\) is arbitrarily close to a point in \(A\). A good example is the subset of rational numbers (subset \(A\)) in the set of real numbers (set \(X\)): any real number is either a rational number or there is a rational number very close to it.

The Universal Approximation Theorem shows that a two-layer perceptron, i.e. \(f(x) = c^T \sigma(Ax+b)\) is dense in the space of continuous functions on \(\mathbb{R}^d\).

From this paper.

The Lipschitz continuity of a function refers to its smoothness as the input is perturbed. More accurately, it is the maximum absolute change in the output per unit norm changed in the input.

Given a function with \(dom(f) \subseteq \mathbb{R}^d\), that function is \(\beta\)-Lipschitz continuous w.r.t to some \(\alpha\) norm, if for all \(x,y \in dom(f)\) it satisfies

\[ \| f(x) - f(y) \|_\alpha \leq \beta \|x-y\|_\alpha \]

Typically, we are interested in the smallest \(\beta\). We want this property because it is a proxy for good generalization. Alternatively, we don’t want it to be too smooth because it would suggest a high bias and low variance, which can mean underfitting.

Linear Algebra

The Frobenius norm (defined only for matrices) is:

\[ \|\mathbf{X}\|_{F} = \sqrt{\sum\limits_{i=1}^n \sum\limits_{j=1}^m |X_{ij}|^2}. \]

It can be thought of as vectorizing the matrix and calculating the 2-norm.

The optimal rank-r approximation to \(\mathbf{X}\), in a least-squares sense, is given by the rank-r SVD truncation \(\tilde{\mathbf{X}}\).

\[ \text{arg}\min\limits_{\tilde{\mathbf{X}}, \text{s.t. rank}(\tilde{\mathbf{X}}) = r}\ \| \mathbf{X} - \tilde{\mathbf{X}}\|_F = \tilde{\mathbf{U}}\tilde{\mathbf{\Sigma}}\tilde{\mathbf{V}*}. \]

Geometry

From the respective Wikipedia articles.

An ismorphism is a structure-preserving mapping that is invertible (e.g., taking the \(\log\), any invertible matrix, etc.).

A diffeomorphism is an invertible function such that both manifolds are continuously differentiable.

A homeomorphism is an invertible function that preserves the topological properties.

Statistics

The expected value is a weighted average of a random variable, where the weights come from a probability distribution. As such, the weights sum to \(1\). The simplest case is:

\[ \mathbb{E}[X] = \sum_{x\in X} (x) \operatorname{pmf of }x. \]

Check out Note 9 LOTUS below!

From these two StackExchange posts [1, 2].

Often, like here, the expectation is introduced as

\[ \mathbb{E}[g(X)] = \sum_{x\in X} g(x) \operatorname{pmf of }x, \]

which relies on the assumption that you can just replace \(x\) with \(g(x)\), from the real definition of expectation. Well, this assumption is true, hence LOTUS! But, it requires a little bit of work to arrive there. If we define \(Y=g(X)\)

\[ \begin{aligned} \mathbb{E}[g(X)] &= \mathbb{E}[Y] = \sum_{y\in Y} y \operatorname{pmf of }y \\ &= \sum_{y\in Y} y \text{Pr}[Y=y] \\ &= \sum_{y\in Y} y \text{Pr}[g(X)=y] \\ &= \sum_{y\in Y} y \sum_{x: g(x)=y} \operatorname{pmf of }x \\ &= \sum_{y\in Y} \sum_{x: g(x)=y} y \cdot \operatorname{pmf of }x \\ &= \sum_{x\in X} g(x) \operatorname{pmf of }x \\ \end{aligned} \]

In words, we start off by defining a new random variable \(Y=g(X)\) and writing the expectation over that. Then, we rewrite the PMF of \(Y\). Next, we account for the fact that \(g(\cdot)\) might not be injective, which means there could be many \(x\) that map to a single \(y\). So, \(\text{Pr}[Y=y]\) needs to sum over all \(x\) values where \(g(x)=y\). So, \(\text{Pr}[g(X)=y]\) is a sum of the probabilities of the relevant \(x\). Finally, we rearrange the double sum: we’re partitioning all \(x\) values by which \(y\) they map to, but since \(y = g(x)\) for each term in the inner sum, we can replace \(y\) with \(g(x)\) and simply sum over all \(x\) directly.

Notes come from here

The moment of something is the expected value of a power of the random variable:

\[ \mathbb{E}[X^k] = \int_{}^{} X^k \operatorname{pdf} dx. \] When we are using the moment generating function (MGF), we are trying to walk around this integral, by instead doing:

\[ \mathbb{E}[e^{tX}] = \int_{}^{} e^{tX} \operatorname{pdf} dx. \]

We do this because we use the Maclaurin series expansion to approximate the exponential function. The Maclaurin series is just the Taylor series but evaluated at 0.

\[ \begin{aligned} e^{tX} &= 1 + \frac{X}{1!}t + \frac{X^2}{2!}t^2 ... \\ \mathbb{E}[e^{tX}] &= 1 + \frac{\mathbb{E}[X]}{1!}t + \frac{\mathbb{E}[X^2]}{2!}t^2 ... \\ \end{aligned} \]

Looking for the first moment, one can take the first derivative of this expected value. Then, just set \(t=0\) and discard all the remaining terms.

Conceptually then, we have access to a function that is very easy to expand into a sum of quantities. And, when you expand, then take the correct derivative of that expansion, it will give you the desired moment. Hence, the term “generating.” This is what that looks like symbolically:

\[ M^{k}_X(0) = \frac{d^kM_X(t)}{dt^k} \rvert_{t=0}. \]

MGFs are neat because, it’s often easier to derive the MGF once (integrate exponential using LOTUS) and compute moments (differentiation), rather than constantly use the integral definition of expectation to get moments.

The law of total expectation looks like \[ \mathbb{E}[X] = \mathbb{E}[\mathbb{E}[X|Y]] = \mathbb{E}[X|y_1]P(y_1) + \mathbb{E}[X|y_2]P(y_2) + \dots \]

Sometimes, we only know the expectation of a random variable conditioned on another. If we know the probability distribution of the variable we conditioned on, we can reconstruct the original expectation.

https://en.wikipedia.org/wiki/Isserlis%27_theorem Wick’s probability theorem is a formula that allows one to compute higher-order moments of the multivariate normal distribution in terms of its covariance matrix

http://mathonline.wikidot.com/cauchy-product-of-power-series

How Did I Pass Undergrad?

The way to think about negative numbers is to think about ‘flips.’ Multiplying \(x\) by a negative number models flipping between two states

\[ x, -x, x, -x, ... \]

An even better way is to think about negative numbers as a ‘180 degree rotation.’ For that, imagine a plane with a real axis and an imaginary axis.

Code
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.pyplot import figure

# Set up plot
fig, ax = plt.subplots(subplot_kw={'projection': 'polar'}, figsize=(4,4))

# Plot vectors
kw_dict = {'angles': 'xy', 'scale_units': 'xy', 'scale': 1.5}
ax.quiver(0, 0, 0, 1, color='black', **kw_dict)
ax.quiver(0, 0, np.pi / 2, 1, color='red', **kw_dict)
ax.quiver(0, 0, np.pi, 1, color='black', **kw_dict)
ax.quiver(0, 0, -np.pi / 2, 1, color='red', **kw_dict)

# Add annotations
ax.text(np.pi / 4, 0.5, r'$i$', ha='center', color='black')
ax.text(3 * np.pi / 4, 0.5, r'$i^2 = -1$', ha='center', color='black')
ax.text(5 * np.pi / 4, 0.5, r'$i^3 = -i$', ha='center', color='black')
ax.text(7 * np.pi / 4, 0.5, r'$i^4 = 1$', ha='center', color='black')


# Plot configuration
ax.set_rticks([])
ax.set_rmin(0)
ax.set_rmax(1)
ax.set_thetalim(-np.pi, np.pi)
ax.set_xticks(np.linspace(np.pi, -np.pi, 4, endpoint=False))

# Make transparent
plt.gcf().patch.set_alpha(0)

plt.show()
Figure 1: A polar plot in the complex plane

Then, multiplying with \(i\) is like a ‘90 degree rotation’ (on the complex plane):

\[ x, xi, -x, -xi, x, ... \]

Complex numbers model rotations in a 2D plane.

We use complex numbers to model circular movement on a 2D, complex plane. The general Euler’s formula is

\[ e^{ix} = \cos{x} + i \sin{x} \]

The RHS of this equation is describing a position on the 2D complex plane using the Cartesian style.

  • If we consider varying just the first term \(\cos{x}\), we would be yoyo-ing on the real (horizontal) axis.
  • In the same way, the \(i\sin{x}\) term yoyos on the imaginary (vertical) axis. Really, if you want to break it down further, you are yoyo-ing on the real axis \(\sin{x}\) and then rotating by 90 degrees (multiplying by \(i\)).
  • The two yoyos are offset relative to each other (\(\sin{x} \text{ vs.} \cos{x}\)), allowing you to draw an arc. Summing the two terms, we can describe any point within a unit circle.

The LHS of this equation is an imaginary exponential. Imaginary exponentials are weird.

  • The derivative of \(e^{ix}\) is \(\frac{d}{dx}e^{ix} = ie^{ix}\), which is a velocity vector. The velocity vector is perpendicular to the original vector (because remember multiplying any vector by \(i\), rotates it by 90 degrees on the complex plane!)
  • A basic fact of vector calculus is that any vector \(\vec{v}\) has a direction \(\vec{d}\) and a magnitude \(\alpha\). In our case, the velocity vector has direction \(ie^{ix}\) and magnitude \(1\).
  • Because the velocity vector is always exactly perpendicular to the original vector and the magnitude never changes, it results in a constant rotation.

Overall, both sides of the equation describe a movement on a unit circle.

Code
# Set up plot
fig, ax = plt.subplots(subplot_kw={'projection': 'polar'}, figsize=(4,4))

# Plot vectors
len = 1
theta = np.pi/16
kw_dict = {'angles': 'xy', 'scale_units': 'xy', 'scale': 1}
ax.quiver(0, 0, 0, len, color='#0075ff', **kw_dict)
ax.quiver(0, len, theta, len/np.cos(theta)-len, color='green', **kw_dict)
ax.quiver(0, 0, theta, 1, color='black', **kw_dict)

# Annotate
ax.text(-theta, 0.6, r'$\vec{v}$', ha='center', color='#0075ff')
ax.text(theta/4, 1.2, r'$i\vec{v} \Delta x$', ha='center', color='green')

# Plot configuration
ax.set_rticks([])
ax.set_rmin(0)
ax.set_rmax(1)
ax.set_thetalim(-np.pi, np.pi)
ax.set_xticks(np.linspace(np.pi, -np.pi, 4, endpoint=False))

ax.set_rticks([0.5, 1, 1.5])
ax.grid(True)

# Make transparent
plt.gcf().patch.set_alpha(0)

plt.show()
Figure 2: A polar plot in the complex plane

To flesh out the rotation around a circle, Figure 2 visualizes a vector \(\vec{v}\) and a perpendicular vector \(i\vec{v}\) scaled by an arbitrary value \(\Delta x\). If we add those two vectors together \(\vec{v} + i\vec{v} \Delta x\), for a small \(\Delta x\) we get the new vector in black.

It should be second nature that “one-to-many” is not a function, because functions cannot be ambiguous in their output. However, we commonly have many-to-one mappings. I like to think of this general case as “many parents.” And sometimes, no parents. In this metaphor, parents live in the domain and children live in the codomain.

There are a couple special cases: injection and surjection. Injection (one-to-one) is when there is at most one parent (like an Uno-reversed “One-child policy”). Surjection (onto) is when there are no orphans, i.e. at least one parent. Bijection is strictly when there is only one parent.

“sur”, ultimately from Latin, is “over” or “on top of” (surface, surcharge, surplus). Here it means “covering.” If a function is surjective, we have covered the entire codomain (no orphans!) I cannot find a good reason for why it’s called “injective.”

From this StackExchange post.

Citation

BibTeX citation:
@online{aswani2021,
  author = {Aswani, Nishant},
  title = {Technical {Glossary} {(Never} {Complete)}},
  date = {2021-08-01},
  url = {https://nishantaswani.com/articles/gloss/},
  langid = {en}
}
For attribution, please cite this work as:
Aswani, Nishant. 2021. “Technical Glossary (Never Complete).” August 1, 2021. https://nishantaswani.com/articles/gloss/.