bias_variance

Author

Affiliation

NYU Abu Dhabi, NYU Tandon

From the Geometric Deep Learning book, this pleasant seminar, and the paper that the seminar is based on.

In supervised learning, we often care about minimizing an objective. A classic objective to minimize is the population risk or true risk, written as \[ \mathcal{R}(f) := \mathbb{E}[\ell(f(x), y)] = \int_{X\times Y}\ell(f(x), y) \, d P(x,y). \] Unfortunately, this is a theoretical quantity because we usually don’t have access to the real data distribution \(P\).

Instead, we do the next best thing: we minimize the empirical risk, defined as \[ \mathcal{\hat{R}}(f) := \frac{1}{N} \sum_{i=1}^N \ell(f(x_i), y_i). \] Our hope is that this empirical risk converges to the population risk, as \(N \rightarrow \infty\). The empirical risk is a proxy to our real goal; of course, we will be penalized for using a proxy. This penalty is called the excess risk and is written as a difference between the two risks, \(\hat{\mathcal{R}}(f) - \mathcal{R}(f)\).

Even if we had access to the entire data distribution, we would still need a function that allows us to compute the true risk. To get the true risk, we imagine the ideal function \(f^*\) (Bayes optimal predictor), such that \(\mathcal{R}(f^*)\), is the lowest possible value one could achieve.

To make progress, we have to define how we minimize the empirical risk. Well, given a class \(\mathcal{F} = \{f_\theta; \theta \in \Theta \}\), the function that minimizes the empirical risk is the estimator \(f_\hat{\theta}\), where \[ \hat{\theta} \in \arg \min_{\theta \in \Theta} \hat{R}(f_\theta) = \frac{1}{N} \sum_{i=1}^N \ell(f_\theta(x_i), y_i). \] This is exactly the definition of the empirical risk, except we have included \(\theta\) as some set of values that parameterizes the function \(f\), giving us a more the concrete object \(f_\theta\) to work it. We also threw in the \(\arg\min\) to tell us that we are minimizing the risk to find \(\theta\). It’s all symbolic changes, but now we have a real problem to solve.

After optimization, let’s call the parameters we end up with \(\hat{\theta}\). Then our empirical risk minimizer is \(f_\hat{\theta}\). Importantly, now that we have \(f_{\hat{\theta}}\), we must distinguish it from \(f^*\), which, assuming access to the true risk, is the ideal solution. Our excess risk actually looks like: \[ \hat{\mathcal{R}}(f_{\hat{\theta}}) - \mathcal{R}(f^*). \] When minimizing the empirical risk, we are solving for \(\theta\) variables in a system of \(N\) equations. When \(\theta > N\), we are in the over-parameterized regime, where there are several possible solutions. In addition, we cannot forget that we are actually minimizing the empirical risk, instead of the true risk, so we are at the risk (hah!) of non-generalizable solutions!

To resolve this, we use might use regularization where we penalize “complex” solutions. To do so, we introduce a hyper-parameter \(\delta\) to control the “capacity” of a model. Then if we come across some measure \(\gamma(\theta)\), which gives us a heuristic of how “complex” the parameters are, regularization implies that we are bounding this complexity by our hyper-parameter. Symbolically, this looks like \(\gamma(\theta) \leq \delta\). This regularized function is defined as \(\inf_{\gamma(\theta) \leq \delta} \mathcal{R}(f_{\theta})\).

We still care about excess risk. In fact, we can further study it by decomposing it. If we add and subtract our regularized function (effectively doing nothing), then do some regrouping, we can glean something new.

\[ \hat{\mathcal{R}}(f_{\hat{\theta}}) - \mathcal{R}(f^*) = \left( \hat{\mathcal{R}}(f_{\hat{\theta}}) - \inf_{\gamma(\theta) \leq \delta} \mathcal{R}(f_{\theta}) \right) + \left( \inf_{\gamma(\theta) \leq \delta} \mathcal{R}(f_{\theta}) - \mathcal{R}(f^*) \right) \] Before we study this, let’s take this decomposition just a little further. So far we have been operating under the slightly unrealistic assumption that the \(f_{\hat{\theta}}\) we found is the best one we could have arrived that. Practically speaking, we will have some optimization error. To include this error, we have to first concede that our optimization algorithm arrives at the more sub-par solution \(\hat{\mathcal{R}}(\hat{f})\), instead of the earlier solution \(\hat{\mathcal{R}}(f_{\hat{\theta}})\). Then, we can define optimization error as \(\left( \hat{\mathcal{R}}(\hat{f}) - \hat{\mathcal{R}}(f_{\hat{\theta}}) \right)\).

Since we made that first concession, we also have to revise the definition of excess risk. We replace the original \(\hat{\mathcal{R}}(f_{\hat{\theta}}) - \mathcal{R}(f^*)\) with the slightly more realistic \(\hat{\mathcal{R}}(\hat{f}) - \mathcal{R}(f^*)\). Finally, let’s introduce the optimization error to the decomposition: \[ \begin{aligned} \hat{\mathcal{R}}(\hat{f}) - \mathcal{R}(f^{*}) &= \left( \hat{\mathcal{R}}(\hat{f}) - \hat{\mathcal{R}}(f_{\hat{\theta}}) \right) \\ & + \left( \hat{\mathcal{R}}(f_{\hat{\theta}}) - \inf_{\gamma(\theta) \leq \delta} \mathcal{R}(f_{\theta}) \right) \\ & + \left( \inf_{\gamma(\theta) \leq \delta} \mathcal{R}(f_{\theta}) - \mathcal{R}(f^*) \right). \end{aligned} \] This is (a variation of) the risk decomposition! On the right hand side of the equation, * the first term describes the optimization error, which is a penalty due to poor optimization, because of some combination of computational constraints, bad hyper-parameters, and a poor algorithm. * the second term describes an estimation error, which is a penalty of having finite data. If we had more data, we could probably reduce this term. * the third term describes an approximation error, which is the penalty of architectural selection and also accounts for regularization.

In this breakdown, learning theory talks about a trade-off between the second and third terms, namely the “approximation/estimation trade-off” (A/E trade-off). If we chose a very limited capacity model (simple model), we might have a high approximation error, but a low estimation error. Informally, it would not take a lot of data to get to the best version of that simple model, but that simple model is likely to be far from the best model. In that case, we would be under-fitting our data. On the other hand, if we picked a very rich model class, then we would probably have need a lot more data to train that model well, but that model class might be complex enough to be close to the ideal solution. Here we risk over-fitting our data.

There is another popular trade-off, the “bias-variance trade-off” (B/V trade-off), which gets quite a bit of airtime. That trade-off comes from a different decomposition of risk. One of the key points of difference is that while the A/E trade-off assumes a single dataset in its analysis, the B/V trade-off operates on an expectation over all possible datasets. Let’s introduce this change: \[ \begin{aligned} \mathbb{E} \left[ \hat{\mathcal{R}}(\hat{f}) - \mathcal{R}(f^{*}) \right] = & \mathbb{E} \left[ \hat{\mathcal{R}}(\hat{f}) - \hat{\mathcal{R}}(f_{\hat{\theta}}) \right] \\ & + \mathbb{E} \left[ \hat{\mathcal{R}}(f_{\hat{\theta}}) \right] - \inf_{\gamma(\theta) \leq \delta} \mathcal{R}(f_{\theta}) \\ & + \inf_{\gamma(\theta) \leq \delta} \mathcal{R}(f_{\theta}) - \mathcal{R}(f^*). \end{aligned} \] An additional change is that the B/V trade-off concerns itself with the risk of the expected model \(\mathcal{\hat{R}}(\mathbb{E}[\hat{f}])\), which we sneak in between the estimation error:

\[ \mathbb{E} \left[ \hat{\mathcal{R}}(\hat{f}) - \mathcal{R}(f^{*}) \right] = \mathbb{E} \left[ \hat{\mathcal{R}}(\hat{f}) - \hat{\mathcal{R}}(f_{\hat{\theta}}) \right] + \mathbb{E} \left[ \hat{\mathcal{R}}(f_{\hat{\theta}}) \right] - \mathcal{\hat{R}}(\mathbb{E}[\hat{f}]) + \mathcal{\hat{R}}(\mathbb{E}[\hat{f}]) - \inf_{\gamma(\theta) \leq \delta} \mathcal{R}(f_{\theta}) +\inf_{\gamma(\theta) \leq \delta} \mathcal{R}(f_{\theta}) - \mathcal{R}(f^*) \]

We now have a new breakdown of terms; namely, the estimation error gets split into the estimation variance and the estimation bias errors. Combining the expected optimization error with the estimation variance gives us the variance of a model.

Citation

BibTeX citation:

@online{aswani,
  author = {Aswani, Nishant},
  url = {https://nishantaswani.com/articles/gloss/bias_variance.html},
  langid = {en}
}

For attribution, please cite this work as:

Aswani, Nishant. n.d. https://nishantaswani.com/articles/gloss/bias_variance.html.