In this set of notes, we derive the bias and variance for linear regression models, including linear basis function models.
For data \((x_i, y_i), i=1,\ldots,n\), consider the linear basis function model:
\[\hat{y} = f(x,{w}) = \phi(x)^T {w} = w_1 \phi_1(x) + \ldots +w_p \phi_p(x)\]
The least squares fit is
\[{w} = (\Phi^T \Phi)^{-1} \Phi^T y\]
where
\[ \Phi = \begin{bmatrix} \phi_1 ({x_1}) & \cdots & \phi_p ({x_1}) \\ \vdots & \ddots & \vdots \\ \phi_1 ({x_n}) & \cdots & \phi_p ({x_n}) \end{bmatrix} \]
Assume the true function is \(t(x)\) such that
\[y=t(x) + \epsilon, \quad \epsilon \sim N(0, \sigma_\epsilon^2)\]
When there is no under-modeling,
\[t(x) = f(x,w_t) = \phi(x)^T w_t\]
where \(w_t\) is the true parameter vector.
For \(\Phi \in R^{n\times p}\), there is a unique solution to the ordinary least squares estimate
\[{w} = (\Phi^T \Phi)^{-1} \Phi^T y\]
only if \(\text{Rank}(\Phi) = n\). This will be the case if the columns of \(\Phi\) are linearly independent, and \(n \geq p\).
In other words, the unique solution exists only if the number of data samples for training (\(n\)) is greater than or equal to the number of parameters \(p\).
This limits the model complexity you can use (greater \(p\) \(\implies\) greater model complexity).
For the rest of these notes, we will assume a unique least squares solution (\(n \geq p\)).
Let us give a definition of bias on a test point, \((x_t, y_t)\) for a function \(f\) with parameter estimate \(\hat{w}\):
\[\text{Bias}(x_t) := t( x_t ) - E[f(x_t, \hat{w})] \]
We will try to derive the bias for a linear regression when the true function is in the assumed model class, i.e. there is no under-modeling.
Suppose that there is no under-modeling, so there is a parameter vector \(w_t\) such that
\[t(x) = f(x,w_t) = \phi(x)^T w_t\]
Then for each training sample \(i=1,\ldots,n\),
\[y_i = \phi(x_i)^T w_t + \epsilon_i\]
and for the entire training set, \(y=\Phi w_t + \epsilon\).
For a fixed training set, the least squares parameter estimate will be
\[ \begin{aligned} \hat{w} &= (\Phi^T \Phi)^{-1} \Phi^T y \\ &= (\Phi^T \Phi)^{-1} \Phi^T (\Phi w_t + \epsilon) \\ &=w_t+ (\Phi^T \Phi)^{-1} \Phi^T \epsilon \end{aligned} \]
Now we can find \(E[\hat{w}]\) over the samples of noisy training data: since \(E[\epsilon] = 0\), we have \(E[\hat{w}] =w_t\).
Informally, we can say that on average, the parameter estimate matches the “true” parameter.
Then \(E[f(x_t, \hat{w})] = E[f(x_t, w_t)] = t(x_t)\).
Conclusion: We can see that when the model is linear and there is no under-modeling, there is no bias:
\[\text{Bias}(x_t) = 0\]
Before we look at the variance, we will review some terminology of random vectors:
\[Var(x) := E[(x - {u})(x - {u})^T ] = \begin{bmatrix} Cov (x_1, x_1) & \cdots & Cov (x_1, x_d) \\ \vdots & \ddots & \vdots \\ Cov (x_d, x_1) & \cdots & Cov (x_d, x_d) \\ \end{bmatrix} \]
Now us give a definition of variance on a test point, \((x_t, y_t)\) for a function \(f\) with parameter estimate \(\hat{w}\):
\[\text{Var}(x_t) := E[ (f(x_t, \hat{w}) - E[f(x_t, \hat{w})] ) ^2 ]\]
We will try to derive this variance in three steps: first, we will find the variance of the parameter estimate \(\hat{w}\). Then, we will describe the variance of the model output \(f(x_t, \hat{w})\) for a fixed \(x_t\). Finally, we will find the expected variance over the distribution of \(x_t\).
Recall that \(\epsilon_i\) are independent for different samples, with \(E[\epsilon_i] = 0\) and \(Var(\epsilon) = \sigma_\epsilon^2\).
Then,
\[ Cov(\epsilon_i, \epsilon_j) = \begin{cases} 0, \quad i \neq j \\ \sigma_\epsilon^2, \quad i=j \\ \end{cases} \]
so the variance matrix for the \(\epsilon\) noise is
\[Var(\epsilon) = \sigma_\epsilon^2 I\]
(\(I\) is the identity matrix). Also recall from our discussion of bias that with no under-modeling,
\[\hat{w} = w_t + (\Phi^T \Phi)^{-1} \Phi^T \epsilon\]
Let us think of this as a linear transform of \(\hat{w}\), \(y = Ax + b\) where:
and recall that for a linear transform \(y = Ax + b\), \(Var(y) = A Var(x) A^T\).
Then we can compute the variance matrix of the parameter estimate for the linear model as
\[ \begin{aligned} Var(\hat{w}) &= [(\Phi^T \Phi)^{-1} \Phi^T] [Var(\epsilon)] [(\Phi^T \Phi)^{-1} \Phi^T]^T \\ &= [(\Phi^T \Phi)^{-1} \Phi^T] [\sigma_\epsilon^2 I] [(\Phi^T \Phi)^{-1} \Phi^T]^T \\ &= [(\Phi^T \Phi)^{-1} \Phi^T] [\sigma_\epsilon^2 I] [\Phi (\Phi^T \Phi)^{-1}] \\ & = \sigma_\epsilon^2 (\Phi^T \Phi)^{-1} \end{aligned} \]
Now, we will use \(Var(\hat{w})\) to compute \(Var(x_t)\) for the linear model.
First, recall from our discussion of bias that when there is no under-modeling
\[E[f(x_{t}, \hat{w})] = \phi(x_{t})^T \hat{w} = \phi(x_{t})^T w_t \]
Then the variance of the linear model output for a test point is
\[ \begin{aligned} Var(x_{t}) & = E [f(x_{t}, \hat{w}) - E[f(x_{t}, \hat{w})]]^2 \\ &= E [\phi(x_{t})^T \hat{w} - \phi(x_{t})^T w_t ]^2 \\ &= E [\phi(x_{t})^T (\hat{w} - w_t) ]^2 \\ \end{aligned} \]
Also note the following trick: if \({a}\) is a non-random vector and \({z}\) is a random vector, then
\[E[{a}^T {z}]^2 = E[{a}^T {zz}^T {a}] = {a}^T E[{zz}^T]{a}\]
Therefore,
\[ \begin{aligned} Var(x_{t}) & = E [\phi(x_{t})^T (\hat{w} - w_t) ]^2 \\ & = \phi(x_{t})^T E [(\hat{w} -w_t)(\hat{w} -w_t)^T] \phi(x_{t}) \\ \end{aligned} \]
Finally, recall that
\[Var(\hat{w}) = E [(\hat{w} -w_t)(\hat{w} -w_t)^T] = \sigma_\epsilon^2 (\Phi^T \Phi)^{-1}\]
so
\[ \begin{aligned} Var(x_{t}) & = \phi(x_{t})^T E [(\hat{w} -w_t)(\hat{w} -w_t)^T] \phi(x_{t}) \\ & = \sigma_\epsilon^2 \phi(x_{t})^T (\Phi^T \Phi)^{-1} \phi(x_{t}) \\ \end{aligned} \]
This derivation assumed there is no under-modeling. However, in the case of under-modeling, the variance expression is similar.
For the next part, we will compute the variance term from the in-sample prediction error, i.e. the error if the test point is randomly drawn from the training data:
Each row of \(\Phi\) is a vector \(\phi(x_i)\) for sample \(i\), then
\[\Phi^T \Phi = \sum_{i=1}^n \phi(x_i) \phi(x_i)^T\]
We will use a trick: for random vectors \({u}, {v}\), \(E[{u}^T{v}] = Tr( E[{v} {u}^T])\), where \(Tr(X)\) is the sum of diagonal of \(X\).
Then the expectation (over the test points) of the variance of the model output is:
\[ \begin{aligned} E [Var(x_t)] & = \sigma_\epsilon^2 E [\phi(x_t)^T (\Phi^T \Phi)^{-1} \phi(x_t)] \\ & = \sigma_\epsilon ^2 Tr\left( E [\phi(x_t) \phi(x_t)^T] (\Phi^T \Phi)^{-1}\right) \\ & = \frac{\sigma_\epsilon ^2}{n} Tr\left(\sum_{i=1}^n [\phi(x_{i}) \phi(x_{i})^T] (\Phi^T \Phi)^{-1}\right) \\ & = \frac{\sigma_\epsilon ^2}{n} Tr\left( (\Phi^T \Phi)(\Phi^T \Phi)^{-1}\right) \\ & = \frac{\sigma_\epsilon ^2}{n} Tr\left( I_p \right) \\ & = \frac{\sigma_\epsilon ^2 p}{n} \end{aligned} \]
The average variance increases with the number of parameters \(p\), and decreases with the number of samples used for training \(n\), as long as the test point is distributed like the training data.
Suppose the model class is linear with \(n\) samples and \(p\) parameters.
When \(n < p\), the least squares estimate of the coefficients is not unique.
When \(n \geq p\) and the least squares estimate of the coefficients is unique, and there is no under-modeling, then the estimate of the target variable is unbiased.
When \(n \geq p\), the least squares estimate of the coefficients is unique, there is no under-modeling, and the test point is drawn from the same distribution as the trainng data, then the variance of the estimate of the target variable increases linearly with the number of parameters and inversely with the number of samples used for training:
\[ Var = \frac{p}{n} \sigma_\epsilon^2 \]
The overall expected in-sample prediction error for the ordinary least squares linear regression is
\[0 + \frac{p}{n} \sigma_\epsilon^2 + \sigma_\epsilon^2\]
where the three terms represent the squared bias, the variance, and the irreducible error.