Week 4¶
Ridge Regression¶
Motivation¶
For polynomial regression, if the degree \(d\) is large, prediction can be highly variable. The idea here is to "dampen" the coefficients \(\hat{\beta}_i\) in order to constrain higher order terms in the prediction function.
Ridge Regression¶
Suppose we have a \(C>0\) (we assume a proper \(C\) is given by an oracle), and we will constrain \(\boldsymbol{\beta}\) (here \(\boldsymbol{\beta}\) starts from \(\beta_1\) rather than \(\beta_0\)(1)) so that
- In ridge, we don't want to penalize the \(\beta_0\) but other \(\beta\).
The ridge regression estimator \(\boldsymbol{\beta}_{\rm Ridge}\) is defined as
Using Lagrange multiplies to convert the above minimization problem into an equivalent problem:
where the constant \(\lambda>0\) is dependent on \(C\).
Ridge is an example of ERM plus regularization term (or penalty) \(\lambda \|\boldsymbol{\beta}\|^2\). The general formulation of ERM + regularization is
where the \(\operatorname{Penalty}(f)\) penalize "complex" \(f\) more.
Compute the Ridge Estimator¶
To find \(\hat{\boldsymbol{\beta}}_{\rm Ridge}\), we let
Then we have
where the matrix \(\boldsymbol{X}^T\boldsymbol{X}+\lambda I\) is always invertible unlike OLS.
If assume \(\boldsymbol{X}\) has SVD given by \(X = UDV^T\), then we have
where \((D^2+\lambda I)^{-1} D\) is a diagonal matrix given by \(diag\{\frac{d_1}{d_1^2+\lambda},\dots,\frac{d_p}{d_p^2+\lambda}\}\) while OLS has \(\hat{\boldsymbol{\beta}}_{\rm OLS} = VD^{-1}U^T\boldsymbol{Y}\). Therefore, the coefficients in \(\hat{\boldsymbol{\beta}}_{\rm Ridge}\) are usually smaller in magnitude compared to \(\hat{\boldsymbol{\beta}}_{\rm OLS}\).
Centering and Scaling¶
Typically, we center our design matrix and response matrix:
where \(\bar{y} = \frac{1}{n}\sum_{i=1}^{n}y^{(i)}\) is the average of the responses, and
where \(\bar{x}_i = \frac{1}{n}\sum_{j=1}^{n}x_j^{(i)}\) is the average of the \(i\)-th column of \(\boldsymbol{X}\)(1). We center because we don't want to penalize \(\beta_0\).
- Note that there is no column of 1's added in \(\boldsymbol{X}\).
Moreover, we also want to scale \(\tilde{\boldsymbol{X}}\),
We scale so all of the features have the same magnitude.
After centering and scaling, we run ridge regression using the response \(\boldsymbol{Y}^*\) and the design \(\boldsymbol{X}^*\). When new data points come in, we do the centering and scaling using the \(\bar{y}\), \(\bar{x}_i\), and \(\frac{1}{n}\sum_{j=1}^{n}\tilde{x}_{ji}\) of the training data.
LASSO (Least Absolute Shrinkage and Selection Operator)¶
LASSO is another method of ERM + regularization, which is also given by two formulations:
Constrain Formulation
Constrain Formulation
where \(\|\boldsymbol{\beta}\|_1 = \sum_{i=1}^p |\beta_i|\).
For \(p=2\) and \(C=1\), the constraints of ridge and LASSO are respectively \(L^2\) and \(L^1\) balls, and the (contour of) corresponding objective functions are shown as follows.
The \(L^1\) ball can usually impose sparsity (occurrence of zeros(1)) in the LASSO coefficient estimate, and thus LASSO does variable selection.
- In practice, the contour of the objective function usually intersects with the \(L^1\) ball at vertices.
LASSO with \(p=1\)¶
For \(p=1\), find the LASSO estimate
where \(\tilde{y}^{(i)} = y^{(i)}-\bar{y}\) and \(\tilde{x}^{(i)} = x^{(i)} - \bar{x}\).
For \(\beta>0\),
where \(c = \sum_{i=1}^{n}(\tilde{x}^{(i)})^2\) and \(d = \sum_{i=1}^{n}\tilde{x}^{(i)}\tilde{y}^{(i)}\). Then, we have
Similarly, for \(\beta<0\), we have
Therefore, the LASSO estimator for \(p=1\) is given by
The following figure illustrates the difference of the OLS estimator and LASSO estimator, where we can see \(\hat{\beta}_{\rm LASSO}\) applies a soft threshold to \(\pm\frac{\lambda}{2c}\)(1).
- Compared to the soft threshold, a hard threshold would show "jumps" in figures, which indicates the function is not continuous.
For more features, we may use a coefficient plot with respect to \(\lambda\), i.e. \(\beta_i\)-\(\lambda\) plot(1). As \(\lambda\) grows, the later a \(\beta_i\) becomes 0, the more important this corresponding feature should be as it remains in our model even with large penalty.
- For \(p=1\), \(\beta\) is linear with respect to \(\lambda\). For \(p>1\), we may choose a proper transformation (using parameters like \(d\) and \(c\)) of \(\lambda\) to make the coefficient plot piecewise linear as the figure below.
Group LASSO¶
Suppose our regression coefficients \(\boldsymbol{\beta}\) is partitioned into groups as
where \(\beta^{(i)}\in \mathbb{R}^{p_i}\). We may want to set entire groups of coefficients \(\beta^{(i)}\) to zero. Then we introduce the penalty
where \(\|\beta^{(i)}\| = \sqrt{\sum_{j=1}^{p_i}(\beta^{(i)}_j)^2}\).
Data Splitting and Cross-Validation¶
Suppose we have two classes of functions (or models or learning algorithms) \(\mathcal{F}_1\) and \(\mathcal{F_2}\). We may not want to use the empirical risk, based on the whole data set, to evaluate these two classes. This is because we use the data set to fit our model under the guidance of ERM, and therefore, the model from a more complex class will achieve a lower empirical risk on this exactly same data set(1).
- An extreme example: we fit the data set using a very high degree polynomial, and we will find a zero-ERM polynomial model. If we follow the standard of ERM, this should be THE BEST model, which is clearly not the case.
Solution is to split the data into training set and a validation set:
More precisely, our objective is to estimate \(R(\hat{f}_i,P)\), \(\hat{f}_i\in \mathcal{F}_i\), and find the model \(i\) that minimizes, which is achieved by the following steps:
-
Fit \(\hat{f}_i\in \mathcal{F}_i\) via ERM (or another method) using only the data in \(\mathcal{D}_{\rm Train}\).
-
Compute \(\displaystyle \frac{1}{|\mathcal{D}_{\rm Valid}|} \sum_{\left(x^{(i)},y{(i)}\right)\in \mathcal{D}_{\rm Valid}} L\left(\hat{f}_i(x^{(j)}),y^{(j)}\right).\)
-
Find the model \(i\) that minimizes Step 2.
-
Usually, we then fit \(\hat{f}\) using the model in Step 3 on the entire data set \(\mathcal{D}_{\rm Train}\cup \mathcal{D}_{\rm Valid}\).
If we consider the expected value of empirical risk conditional on \(\mathcal{D}_{\rm Train}\) (in this case, \(\hat{f} = \hat{f}_{\rm Train}\) is regarded as fixed, which is obtained by \(\mathcal{D}_{\rm Train}\)), we have
where in converse, the empirical risk on whole data set, we will underestimate the true risk.
Example of Ridge Regression¶
For each \(\lambda\) (or equivalently \(C\)), we get a class of functions \(\mathcal{F}_{C_1},\dots,\mathcal{F}_{C_K}\). For some choices of \(C\), if for some \(i,j\), \(C_i<C_j\), then \(\mathcal{F}_{C_i}\subset\mathcal{F}_{C_j}\); similarly, if \(\lambda_i<\lambda_j\), then \(\mathcal{F}_{\lambda_i}\supset\mathcal{F}_{\lambda_j}\). To find the most reasonable model, we follow the steps below:
-
Start off with a grid of \(\lambda\)'s, for instance \(\{\lambda_1 = 0.1, \lambda_2 = 1, \lambda_3 = 10, \lambda_4 = 100\}\).
-
For every \(\lambda_i\), find the ridge estimator \(\hat{\boldsymbol{\beta}}_{\lambda_i}\) using only the training data.
-
We compute \(\displaystyle \frac{1}{|\mathcal{D}_{\rm Valid}|} \sum_{\left(x^{(i)},y{(i)}\right)\in \mathcal{D}_{\rm Valid}} \left(\left(x^{(j)}\right)^T\hat{\boldsymbol{\beta}}_{\lambda_i} - y^{(j)}\right)^2.\)
-
Choose the \(\lambda_i\) with the smallest value in Step 3.
Proportions to Split¶
Usually, we take the 80% or 90% of the data set as the training set.
The larger \(\mathcal{D}_{\rm Valid}\) is the lower the variance of \(\hat{R}(\hat{f},P)\) is. However, we don't want to take \(\mathcal{D}_{\rm Valid}\) too large. This is because if \(\mathcal{D}_{\rm Train}\) is small when fitting \(\hat{f}\), then \(\hat{f}_{\mathcal{D}_{\rm Train}}\) might be very different from \(\hat{f}_{\mathcal{D}_{\rm Train}\cup\mathcal{D}_{\rm Valid}}\), which means \(\hat{f}_{\mathcal{D}_{\rm Train}}\) is biased.
\(K\)-Fold Cross-Validation¶
Split the data into \(K\)-groups of roughly the same size:
We estimate \(R(\hat{f},P)\) by the following steps:
-
For \(i=1,\dots,K\), fit a model using \(\mathcal{D}^{(1)}, \dots, \mathcal{D}^{(i-1)}, \mathcal{D}^{(i+1)}, \dots, \mathcal{D}^{(K)}\) to get \(\hat{f}^{(i)}\).
-
Compute \(\displaystyle CV^{(i)} = \frac{1}{|\mathcal{D}^{(i)}|} \sum_{\left(x,y\right)\in \mathcal{D}^{(i)}} L\left(\hat{f}^{(i)}(x),y\right).\)
-
Estimate \(R(\hat{f},P)\approx \frac{1}{K}\sum_{i=1}^K CV^{(i)} = CV\).
How many folds to choose? Usually \(K\) is chosen from 5-10. To show that it may not always be beneficial to choose large \(K\), we consider an example of \(K=n\) (size of the data set), which is called leave-one-out cross-validation (LOOCV)(1). LOOCV may not always give better estimates of \(R(\hat{f},P)\):
- Specifically, for linear regression we can calculate the LOOCV error of \(\hat{f}(x) = \boldsymbol{\hat{\beta}}^T x\) without fitting \(\hat{f}_i\), which has a closed form expression: \(\displaystyle R(\hat{f},P)= \frac{1}{n} \sum_{i=1}^n\left(\frac{y^{(i)} - \left(x^{(i)}\right)^T\hat{\boldsymbol{\beta}} }{1-\left(x^{(i)}\right)^T \left(X^T X\right)^{-1} x^{(i)}} \right)^2\).
which indicates that if \(Cov\left(CV^{(i)},CV^{(j)} \right)=0\) for \(i\neq j\), then the larger \(K\) is better. However, if \(K\) is large, then \(\hat{f}^{(i)}\approx\hat{f}^{(j)}\) as the training sets are nearly identical, and thus, \(Cov\left(CV^{(i)},CV^{(j)} \right)\) might be large and positive.