Week3¶
Recap on Singular Value Decomposition¶
Orthogonal Matrix¶
Let \(V\in \mathbb{R}^{p\times k}\), given by
Then \(V\) has orthogonal columns if
For such \(V\), we have \(V^T V = I_k\). Note that it is often NOT the case that \(V V^T = I_p\).
A square matrix \(V\in \mathbb{R}^{k\times k}\) with orthogonal columns is called an orthogonal matrix. Note that \(V^{-1} = V^T\).
Thin SVD¶
For a matrix \(A\in \mathbb{R}^{m\times n}\) (\(m\geq n\)), whose rank is \(r\).The thin SVD of \(A\) is a representation of the form
where \(U,V\) have normal orthogonal columns, and \(\Sigma\) is diagonal with the first \(r\) values \(\sigma_1 \geq \dots \geq \sigma_r>0\), where \(\sigma_i\) is the length of \(A\boldsymbol{v}_i\).
The column vectors of \(U\) and \(V\) are respectively called the left and right singular vectors. The singular values of \(A\) are the diagonal elements of \(\Sigma\).
Remarkable results: Every matrix \(A\in \mathbb{R}^{m\times n}\) has an SVD.
General SVD¶
We consider the singular vectors: the \(\boldsymbol{u}\)'s and \(\boldsymbol{v}\)'s give bases for the four fundamental subspaces:
- \(\boldsymbol{u}_1, \dots, \boldsymbol{u}_r \quad\) is an orthonormal basis for the column space
- \(\boldsymbol{u}_{r+1}, \dots, \boldsymbol{u}_m\) is an orthonormal basis for the left nullspace \(\boldsymbol{N}\left(A^T\right)\)
- \(\boldsymbol{v}_1, \dots, \boldsymbol{v}_r \quad\) is an orthonormal basis for the row space
- \(\boldsymbol{v}_{r+1}, \dots, \boldsymbol{v}_n\) is an orthonormal basis for the nullspace \(\boldsymbol{N}(A)\).
More than just orthogonality, these basis vectors diagonalize the matrix \(A\) :
Those singular values \(\sigma_1\) to \(\sigma_r\) will be positive numbers: \(\sigma_i\) is the length of \(A \boldsymbol{v}_i\). The \(\sigma\)'s go into a diagonal matrix that is otherwise zero. That matrix is \(\Sigma\).
Then the equations \(A \boldsymbol{v}_i=\sigma_i \boldsymbol{u}_i\) tell us column by column that \(\boldsymbol{A} \boldsymbol{V}_{\boldsymbol{r}}=\boldsymbol{U}_{\boldsymbol{r}} \boldsymbol{\Sigma}_{\boldsymbol{r}}\) :
Those \(\boldsymbol{v}\)'s and \(\boldsymbol{u}\)'s account for the row space and column space of \(A\). We have \(n-r\) more \(\boldsymbol{v}\)'s and \(m-r\) more \(\boldsymbol{u}\)'s, from the nullspace \(\boldsymbol{N}(A)\) and the left nullspace \(\boldsymbol{N}\left(A^T\right)\). They are automatically orthogonal to the first \(\boldsymbol{v}\)'s and \(\boldsymbol{u}\)'s. We now include all the \(\boldsymbol{v}\)'s and \(\boldsymbol{u}\)'s in \(V\) and \(U\), so these matrices become square. We still have \(\boldsymbol{A} \boldsymbol{V}=\boldsymbol{U} \boldsymbol{\Sigma}\):
The new \(\Sigma\) is \(m\times n\). It is just the \(r\times r\) matrix \(\Sigma_r\) with \(m-r\) extra zero rows and \(n-r\) new zero columns. The real change is in the shapes of \(U\) and \(V\). Those are square orthogonal matrices. So \(A V=U \Sigma\) can become \(\boldsymbol{A}=\boldsymbol{U} \boldsymbol{\Sigma} \boldsymbol{V}^{\mathbf{T}}\). This is the Singular Value Decomposition. I can multiply columns \(\boldsymbol{u}_i \sigma_i\) from \(U \Sigma\) by rows of \(V^{T}\):
Each \(\sigma_i^2\) is an eigenvalue of \(A^T A\)(1) and also \(A A^{T}\)(2). When we put the singular values in descending order(3), \(\sigma_1 \geq \sigma_2 \geq \dots \sigma_r>0\), the above splitting gives the \(r\) rank-one pieces of \(A\) in order of importance. Then \(\sigma_1\) is the maximum of the ratio:
- with \(\boldsymbol{v}\)'s as orthonormal eigenvectors
- with \(\boldsymbol{u}\)'s as orthonormal eigenvectors
- For a matrix \(A\in \mathbb{R}^{m\times n}\), we can define a column swapping matrix \(P_{i,j}\in \mathbb{R}^{n\times n}\), which swaps the \(i\)-th column and \(j\)-th column of \(A\) by right multiplication, i.e. \(AP_{i,j}\). Using \(P_{i,j}\), we can swap the positions of \(i\)-th singular value and \(j\)-th singular value of \(\Sigma\): \(U\Sigma V^T = \left(UP_{i,j}\right) \left(P_{i,j}^T \Sigma P_{i,j}\right) \left(V P_{i,j}\right)^T\).
The column swapping matrix \(P_{i,j}\) has the following properties: (let \(A\) have rows \(v_i\), \(i=1,\dots, m\) and columns \(w_j\), \(j=1,\dots,n\))
- \(AP_{i,j}\): \(v_i \leftrightarrow v_j\).
- \(P_{i,j}^T A = \left(A^TP_{i,j}\right)^T\): \(w_i \leftrightarrow w_j\).
- \(P_{i,j}^T P_{i,j} = I\).
- \(P_{i,j} P_{i,j}^T = P_{i,j}^T P_{i,j} P_{i,j} P_{i,j}^T = I\).
Spectral Decomposition¶
A spectral decomposition of a symmetric matrix \(A\in \mathbb{R}^{n\times n}\) is a representation of \(A\) as
where \(V = \begin{bmatrix}v_1, v_2, \dots, v_n \end{bmatrix}\) is orthogonal and \(D = {\rm diag}\{d_1,d_2,\dots,d_n\}\) is diagonal.
The columns of \(V\) are the eigenvector of \(A\) and \(d_i\) is the associated eigenvalue:
Remarkable results (spectral theorem): Every symmetric matrix \(A\in \mathbb{R}^{n\times n}\) has a spectral decomposition.
Recap on Multivariate Statistics¶
Notations¶
Let \(\boldsymbol{X}\) be a \(k\)-dimensional random vector (a special case of random matrices)
or a \(p\times q\) random matrix
where each \(X_i\in\) or \(X_{ij}\) is a random variable. \(E(\boldsymbol{X})\) is defined componentwise, i.e.
and \(E(\boldsymbol{X})\) has linearity: for constant matrices \(A\in\mathbb{R}^{d\times p}\), $B\in\mathbb{R}^{q\times q}, \(C\in\mathbb{R}^{d\times q}\),
For random vectors \(\boldsymbol{X}\in \mathbb{R}^p\), its covariance matrix is defined as the \(p\times p\) symmetric matrix \(Cov(\boldsymbol{X})\), given by
Similar to the covariance of random variable, we have
From the above, we have
Multivariate Normal Distribution¶
The multivariate normal distribution of a \(k\)-dimensional random vector \(\boldsymbol{X}=\left(X_1, \ldots, X_k\right)^{T}\) can be written in the following notation:
with \(k\)-dimensional mean vector
and \(k \times k\) covariance matrix
\(\boldsymbol{\Sigma}\) is assumed to be positive definite (i.e. non-degenerate) and therefore, \(\boldsymbol{\Sigma}^{-1}\) is also positive definite. In this case, the density of \(\boldsymbol{X}\) is given by
Fact: for a full-rank matrix \(A\in \mathbb{R}^{p\times q}\) (\(q\leq p\)) and \(b\in\mathbb{R}^q\), we have
Linear Regression¶
The basic idea of linear regression is to assume that
where \(\boldsymbol{X} = [x_1, x_2, \dots, x_p]^T\) (for now we assume \(\boldsymbol{x}\in \mathbb{R}^p\)). More precisely, we make a assumption about \(p(y\mid\boldsymbol{x})\) as follows:
where \(\varepsilon\) is noise with \(\varepsilon \sim \mathcal{N}({0,\sigma^2})\) and \(\boldsymbol{x}\) is independent of the noise. Thus, we have
We know that (under squared error loss) the oracle predictor is
The goal is to find \(\beta_0,\dots,\beta_p\) and thus \(f^*\).
Rewrite Training Data in Matrix Form¶
Assume training data \(\{(\boldsymbol{x}^{(i)},y^{(i)})\}_{i=1}^n\) were i.i.d. generated via the assumption about \(p(y\mid\boldsymbol{x})\). To simplify the notations, we define
and design matrix \(\boldsymbol{X}\in \mathbb{R}^{n\times(p+1)}\) given by
Then training data can written as
Now we want to use training data to estimate \(\beta_0,\dots,\beta_p\) and thus \(f^*\).
Recap on Likelihood Function¶
The likelihood function \(L(\theta\mid \boldsymbol{x})\) illustrates the probability of \(\theta\) given data set \(\boldsymbol{x} = (x_1,\dots,x_n)^T\).
More precisely, assume we have \(n\) samples \((x_1,\dots,x_n)\) observed from a distribution \(p_{\theta}(x)\) with an unknown parameter \(\theta\), and our goal is to estimate \(\theta\) using the observed data.
We view the observed samples are realizations of some random variables \(X_1, X_2, \dots, X_n\), which has a joint density function \(p\left(X_1, \dots, X_n \mid \theta\right)\). Given \(X_1=x_1, \dots, X_n=x_n\), we may consider the probability of this event being observed, which is the likelihood function (a function of \(\theta\)) defined by:
Note that the likelihood function is NOT a probability density function. It measures the support provided by the data for each possible value of the parameter. If we compare the likelihood function at two parameter points and find that
then the sample we actually observed is more likely to have occurred if \(\theta=\theta_1\) than if \(\theta=\theta_2\). This can be interpreted as \(\theta_1\) is a more plausible value for \(\theta\) than \(\theta_2\). Therefore, one approach to estimate \(\theta\) is to choose the value of \(\theta\) which gives you the highest probability among all possible values.
With the assumption that samples are drawn i.i.d., the joint probability is given by multiplied probabilities, i.e.
hence taking the log transforms this into a summation, which is usually easier to maximize analytically. Thus, we often write the log-likelihood rather than the likelihood.
Estimate the Coefficients \(\beta_i\)¶
Here we find an estimate of \(\boldsymbol{\beta}\) by maximum likelihood estimation:
By \(\boldsymbol{X}\) is independent of the noise (and vice versa), we have
and thus, the (log-)likelihood function is given by the density function of the above multivariate normal distribution.
By maximizing the likelihood function, we can estimate \(\boldsymbol{\beta}\):
which becomes a least squares problem. We take a gradient with respect to \(\boldsymbol{\beta}\) and set the gradient for 0, and after solving for \(\boldsymbol{\beta}\), we have
where we assume \(\boldsymbol{X}\) is a full-rank matrix and thus \(\boldsymbol{X}^T \boldsymbol{X}\) has an inverse(1).
- Otherwise, if \(\boldsymbol{X}^T \boldsymbol{X}\) is not invertible, there exists \(v\neq \boldsymbol{0}\in \mathbb{R}^{p+1}\) such that \(\boldsymbol{X}^T \boldsymbol{X} v = \boldsymbol{0}\), which indicates columns of \(\boldsymbol{X}\) are linearly dependent. And columns of \(\boldsymbol{X}\) are always linearly dependent when \(p\geq n\).
Furthermore, we can find the distribution of \(\hat{\boldsymbol{\beta}}\) conditional on the training features \(\boldsymbol{X}\):
which indicates that \(\hat{\boldsymbol{\beta}}\) is unbiased;
and therefore,
Evaluate OLS Prediction Estimate¶
The above estimation is called ordinary least squares (OLS) in statistics. Now we evaluate our estimation when receiving a new data pair \((\boldsymbol{x_{*}},y_*)\). We make our prediction by
Consider the bias of \(\hat{f}\) (conditional on \(\boldsymbol{X},\boldsymbol{x_{*}}\)),
We know that \(E(y_* \mid \boldsymbol{x_{*}}) = \tilde{\boldsymbol{x}}^T \boldsymbol{\beta}\), and thus the OLS prediction is unbiased(1).
- This conclusion relied on the assumption that the data has a linear relationship. In practice, our model is only an approximation to the true distribution. So we will have bias.
Now, we consider the variance:
Roughly speaking, if \(\boldsymbol{x_{*}}\) is similar to the training data features, the variance will be small; otherwise, the variance will be large.
Interval Estimate¶
We want to make a prediction interval for \(y_*\) in the form of
We hope \(y_*\) to be contained in this with probability of at least \(1-\alpha\) (usually \(\alpha = 0.05\) or \(\alpha = 0.01\)), i.e.
Recall the assumption that \(y_*\mid \boldsymbol{\tilde{x}} \sim \mathcal{N}(\boldsymbol{\tilde{x}}^T\hat{\boldsymbol{\beta}}, \sigma^2)\). By
and \(Cov\left(y_*, \boldsymbol{\tilde{x}}^T\hat{\boldsymbol{\beta}} \mid \boldsymbol{X}, \boldsymbol{x_{*}}\right)=0\)(1) and thus
- This is because \(\hat{\boldsymbol{\beta}}\) only depends on the training data and we assume newly received data pair is independent of training data.
we know that \(y_* - \boldsymbol{\tilde{x}}^T\hat{\boldsymbol{\beta}}\) is still normally distributed, given by
To obtain the interval estimation, we also need to get rid of \(\sigma^2\) as it is unknown. Recall the assumption that the noise is normally distributed. We can estimate the variance \(\sigma^2\) by the scaled residuals, i.e.
Then we have
follows a t-distribution with \(n-p\) degree of freedom. Let \(t_{n-p,\alpha/2}\) and \(t_{n-p,1-\alpha/2}\) be the \(\alpha/2\) and \(1-\alpha/2\) quantile of the this distribution. Denote \(\sqrt{\hat{\sigma}^2\left(1+\boldsymbol{\tilde{x}}^T(\boldsymbol{X}^T \boldsymbol{X})^{-1}\tilde{\boldsymbol{x}}\right)}\) as \(C\). Then, we have
that is
Examples and Discrete Features¶
So far we assume \(\boldsymbol{x}\) lies in \(\mathbb{R}^p\). Here we will introduce discrete features and mixed features by giving some examples.
Example Polynomial Regression (\(\mathcal{X} = \mathbb{R}\)):
The model takes \(x\) and transforms it to a new feature
and the model becomes
use this new feature in a regression. For instance, assume we have 4 data pairs for training and use a 2-degree polynomial in regression: training data is given by
We first map the following training data to the design matrix \(\boldsymbol{X}\), given by
The OLS estimate is still \(\hat{\boldsymbol{\beta}} = \left(\boldsymbol{X}^T \boldsymbol{X}\right)^{-1} \boldsymbol{X}^T\boldsymbol{Y}\). This allows us to fit \(y\) wit ha non-linear function of \(x\).
Example Categorical Feature:
A categorical feature could be, for instance 3 categories as
We transform categorical \(x\)'s via indicator functions.
First way is the dummy variable method: By choosing a baseline class ( say class '3'), we transform
and the regression model becomes
where we have \(E(y\mid x=3) = \beta_0\), \(E(y\mid x=1) = \beta_0+\beta_1\), and \(E(y\mid x=2) = \beta_0 + \beta_2\).
Another way is more symetric -- we have the model
where we have \(E(y\mid x=1) = \beta_0 + \beta_1\), \(E(y\mid x=2) = \beta_0+\beta_2\), and \(E(y\mid x=3) = \beta_0 + \beta_3\). Since \(\beta_0\) is redundant, it can be dropped.
Assume we have training data:
and the corresponding design matrix is
Example Mixed Features
We can combine continous and categorical features. For instance, if \(x\in \{1,2,3\}\times\mathbb{R}\), we can do similar mapping:
Assume the training data is
then the design matrix is given by
Example Intersactions between Features
We can also have intersactions between features. For instance, if \(\mathcal{X}\in\mathbb{R}^2\), then we can consider thier multiplication in the mapping:
Remark Why don't we just use lots of features? I.e. in polynomial regression, why don't we take degree \(d\) to be large? That is because problems owith overfitting! This is mentioned in the variance-bias tradeoff.
Variable Selection¶
Assume we have many features, i.e. \(\boldsymbol{X}\) has many columns. Given an index set \(I = \left(i_1,i_2,\dots,i_k\right)\), where \(i_1\leq i_2\leq \dots\leq i_k\), define
Then we can obtain a submodel with features in \(I\) give nby
Because of the problem of overfitting, we may not want to use empirical risks to evaluate the submodel (the empirical risk will always decrease when adding new features). A better idea is to add a penalty for including new features:
where if \(\hat{\boldsymbol{\beta}}\) is estimated by using the full model, \(\hat{\sigma}\) is given by
We want to choose proper index set \(I\) to maximize AIC or BIC. For simplicity, we may view \(I\) as a string in length \(p\) consists of 0 and 1, where 0 indicates the feature is not included and 1 indicates the feature is included: for instance,
is an index set of 5 features where the second and fifth feature are included. In this way, \(I\) is a point in \(\{0,1\}^p\) and we may visualize the choice of \(I\) as picking a vertex of a hypercube. Minimizing AIC is the same as searching for the optimal vertex on the cube.
For \(p>40\), search becomes infeasible. If we enumerate the AIC for every \(I\), this called best subset selection. Instead, we can do greedy search over the cube.
Forward selection:
-
Start with trivial null model \(y = \beta_0 + \varepsilon\).
-
Add a feature by searching for the model \(y = \beta_0 + \beta_i x_i + \varepsilon\) over \(i\) that has the smallest AIC/BIC. Call this best \(i\) to be \(i_1\).
-
Continue by find an \(j\) where the model \(y=\beta_0 + \beta_{i_1} x_{i_1} + \beta_j x_j + \varepsilon\) maximizes AIC/BIC. Call this \(j\) to be \(i_2\).
-
Keep on doing this until we reach the full model.
-
Choose the model in our sequence with the smallest AIC/BIC.
Backward selection: start at full model and remove variables until reaching the null model.
Best subset selection needs to fit \(2^p\) linear models. On the other hand, for forward (backward) selection needs to fit
models, which is an order of \(O(p^2)\) computations.