A subset \(A\) of a space \(X (A \subset X)\) is “dense” in \(X\) if it satisfies \[
A \cup \{\lim_{i \rightarrow \infty} a_i : a_i \in A\} = X.
\]
This means that any point in \(X\) is arbitrarily close to a point in \(A\). A good example is the subset of rational numbers (subset \(A\)) in the set of real numbers (set \(X\)): any real number is either a rational number or there is a rational number very close to it.
The Universal Approximation Theorem shows that a two-layer perceptron, i.e. \(f(x) = c^T \sigma(Ax+b)\) is dense in the space of continuous functions on \(\mathbb{R}^d\).
The Lipschitz continuity of a function refers to its smoothness as the input is perturbed. More accurately, it is the maximum absolute change in the output per unit norm changed in the input.
Given a function with \(dom(f) \subseteq \mathbb{R}^d\), that function is \(\beta\)-Lipschitz continuous w.r.t to some \(\alpha\) norm, if for all \(x,y \in dom(f)\) it satisfies
Typically, we are interested in the smallest \(\beta\). We want this property because it is a proxy for good generalization. Alternatively, we don’t want it to be too smooth because it would suggest a high bias and low variance, which can mean underfitting.
The moment of something is the expected value of a power of the random variable:
\[
\mathbb{E}[X^k] = \int_{}^{} X^k \operatorname{pdf} dx.
\] When we are using the moment generating function (MGF), we are trying to walk around this integral, by instead doing:
We do this because we use the Maclaurin series expansion to approximate the exponential function. The Maclaurin series is just the Taylor series but evaluated at 0.
Looking for the first moment, one can take the first derivative of this expected value. Then, just set \(t\) to be 0, to discard all the remaining terms.
Conceptually then, we have access to a function that is very easy to expand into a sum of quantities. And, when you expand, then take the correct derivative of that expansion, it will give you the wanted moment. Hence, the term “generating.” This is what that looks like symbolically:
MGFs are neat because, it’s often easier to derive the MGF once (integrate exponential using LOTUS) and compute moments (differentiation), rather than constantly use the integral definition of expectation to get moments.
Definition 14 (Tower rule / the law of total expectation) The law of total expectation looks like \[
\mathbb{E}[X] = \mathbb{E}[\mathbb{E}[X|Y]] = \mathbb{E}[X|y_1]P(y_1) + \mathbb{E}[X|y_2]P(y_2) + \dots
\]
Sometimes, we only know the expectation of a random variable conditioned on another. If we know the probability distribution of the variable we conditioned on, we can reconstruct the original expectation.
Definition 15 (Law of the unconcious statistician (LOTUS))
Definition 16Wick’s Theorem https://en.wikipedia.org/wiki/Isserlis%27_theorem Wick’s probability theorem is a formula that allows one to compute higher-order moments of the multivariate normal distribution in terms of its covariance matrix
Then, multiplying with a complex number is like a “90 degree rotation”
\[
x, xi, -x, -xi, x, ...
\]
Imaginary numbers can model things that rotate between between two dimensions, or have a periodic nature.
Definition 20Euler’s Formula
We use imaginary numbers to model circular movement. The general Euler’s formula is
\[
e^{ix} = \cos{x} + i \sin{x}
\]
The right hand side of this equation is describing a position on the complex plane using the Cartesian style.
If we consider varying just the first term \(\cos{x}\), we would be yoyo-ing on the real (horizontal) axis.
In the same way, the \(i\sin{x}\) term yoyos on the imaginary (vertical) axis. If you want to break it down further, you are yoyo-ing on the real axis \(\sin{x}\) and then rotating by 90 degress (multiplying by \(i\)).
The two yoyos are offset relative to each other (\(\sin{x} \text{vs.} \cos{x}\)). Summing the two terms, we can describe any point within a unit circle.
The left hand side of this equation is an imaginary exponential. Imaginary exponentials are weird.
The derivative of \(e^{ix}\) is \(\frac{de^{ix}}{x} = ie^{ix}\), which is a velocity vector. The velocity vector is perpendicular to the original vector, because multiplying any vector by \(i\), rotates it by \(\pi/2\).
A fact of vector calculus is that any vector \(\vec{v}\) consists of a direction vector \(\vec{d}\) and a magnitude \(\alpha\). In our case, the velocity vector has direction \(ie^{ix}\) and magnitude \(1\).
Because the velocity vector is always exactly perpendicular to the original vector and the magnitude never changes, it results in a constant rotation.
Overall, both sides of the equation describe a movement on a unit circle.
To flesh out the rotation around a circle, Figure 2 visualizes a vector \(\vec{v}\) and a perpendicular vector \(i\vec{v}\) scaled by an arbitrary value \(\Delta x\). If we add those two vectors together \(\vec{v} + i\vec{v} \Delta x\), for a small \(\Delta x\) we get the new vector in black.
Citation
BibTeX citation:
@online{aswani2000,
author = {Aswani, Nishant},
title = {Technical {Glossary}},
date = {2000-01-01},
url = {https://nishantaswani.com/articles/gloss/},
langid = {en}
}