Cross-Entropy and Maximum Likelihood Estimation

Generative vs Discriminative

The data-generation process is defined as $p(x) p(y \mid x) = p(x,y)$. This joint probability is unknown to the algorithm and is called the ground truth. We assume that the dataset is generated by i.i.d sampling of this joint probability

However, this assumption is not so great for the real-world. That’s why there is a push for learning algorithms where this assumption is violated and the dataset is not independently and identically distributed.

Using the dataset $\mathcal{D}$ to capture $p(x,y)$ is called generative modeling whereas using $\mathcal{D}$ to capture $p(y \mid x)$ is called discriminative modeling

“Capturing” something refers to be able to describe that component: perhaps determining the mean of $p(y \mid x)$ or being able to generate new data points so that it looks like they were from the joint $p(x,y)$

We want to choose a family of distributions (or model), parametrized by some variable, that approximates the component we want to capture. An instance with a specific set of parameters is a hypothesis and all the possible instances with possible parameters is a hypothesis space.

To pick the best hypothesis, we need a fitness/cost function to compare hypotheses. We minimize the fitness function to obtain the best hypothesis.

But first, a little about information theory.

Probability and Information Theory

If the probability of an event is very low, then it carries a large amount of “information”. On the other hand, if the probability of event is high, then it occurring does not share a lot of “information”. However, information is only relevant for a single discrete event.

We define the information of an event $X$ as:

\[I(X) = -log_2(p)\]


If we want to get some sense of information for a random variable then we look towards “entropy”. When measuring information for a set of events, additional events with probably zero should not influence it. The maximum information of two events should not be more than their sum. The observation of nearly certain events should be nearly zero.

With these in mind, entropy is defined as as the expected value of information:

\[\begin{align*} H(X) &= E_{x\sim P}[H(I)] \\ \\ &= E_{x\sim P}[-log(p(x))] \\ \\ &= -E_{x\sim P}[log(p(x))] \\ \\ &= -\int_xp(x)log(p(x)) \end{align*}\]

KL Divergence and Cross-Entropy

If we have a generative model $Q = q(x)$ and we want to compare it to its ground-truth $P = p(x)$, we can look towards the idea of KL Divergence. KL Divergence is a way to measure the dissimilarity between two probability distributions

We can take the difference in the entropy (with respect to one distribution) of both probability distributions:

\[\begin{align*} D_{KL}(P ||Q) &= E_{x\sim P}[I(Q) - I(P)] \\ &= E_{x\sim P}[-log(q(x)) - (-log(p(x))] \\ &= E_{x\sim P}[log(p(x)) - log(q(x))] \\ &= E_{x\sim P}[log \frac{p(x)}{q(x)}] \\ \end{align*}\]

Because KL Divergence is taken with respect to one of the distributions, it is not a symmetric operation. In other words, $D_{KL}(P \parallel Q) \neq D_{KL}(Q \parallel P) $

Another, similar, way to formulate the KL divergence is:

\[\begin{align*} D_{KL}(P ||Q) &= E_{x\sim P}[I(Q) - I(P)] \\ &= E_{x\sim P}[I(Q)] - E_{p(x)}[I(P)] \\ &= E_{x\sim P}[I(Q)] - H(P) \\ &= E_{x\sim P}[-log(q(x))] - H(P) \\ \end{align*}\]

The best hypothesis lies at the pick that minimizes the KL Divergence with respect to the parameters. Since $H(P)$ is a constant value that we cannot influence (its the ground truth and is not affected by the parameters), we can also just minimize the other component $E_{x\sim P}[-log(q(x))]$.

Formally, this is known as the cross-entropy

\[CE(P,Q) = E_{x\sim P}[-log(q(x))]\]

Another way to define it is by rearranging the KL Divergence definition

\[CE(P,Q) = D_{KL}(P||Q) + H(P)\]

Relating to Log Likelihood

For a discriminative model, parameterized by $w$, the cross-entropy is:

\[CE = \sum_k^KE_{y\sim P}[-log(p(y_{i,k}|x_{i,k};w))]\]

for the $i^{th}$ training instance.

If we condense the machine learning models’ predicted distribution as $\hat{y_i} = p(y_i \mid x_i;w)$, then for $k$ classes:

\[\begin{align*} CE_i &= \sum_k^K E_{y\sim P}[-log(\hat{y}_{i,k})] \\ &= \sum_k^K-E_{y\sim P}[log(\hat{y}_{i,k})] \\ &= \sum_k^K y_{i,k}log(\hat{y}_{i,k}) \end{align*}\]

For all instances, we take the mean over all $N$ training instances:

\[\begin{align*} CE &= -\frac{1}{N}\sum_i^N\sum_k^K y_{i,k}log(\hat{y}_{i,k}) \\ \end{align*}\]

Since it is a categorical loss, the value $y_{i,k}$ is 0 for all classes except the ground truth class, where it is 1. The inner sum then collapses and becomes:

\[\begin{align*} CE &= -\frac{1}{N}\sum_i^Nlog(\hat{y}_{i,k_{gt}}) \\ \end{align*}\]

which is equivalent to:

\[\begin{align*} CE &= -\frac{1}{N}\sum_i^Nlog(p(y_i | x_i;w)) \\ \end{align*}\]

Minimizing this cross entropy is equivalent to maximizing

\[\begin{align*} \frac{1}{N}\sum_i^Nlog(p(y_i | x_i;w)) \\ \end{align*}\]