Week3¶
Recap on Singular Value Decomposition¶
Orthogonal Matrix¶
Let
Then
For such
A square matrix
Definition Thin SVD¶
For a matrix
whrere
The column vectors of
Remarkable results: Every matrix
Spectral Decomposition¶
A spectral decomposition of a symmetric matrix
where
The columns of
Remarkable results (spectral theorem): Every symmetric matrix
Recap on Multivariate Statistics¶
Notations¶
Let
or a
where each
and
For random vectors
Similar to the covariance of random variable, we have
From the above, we have
Multivariate Normal Distribution¶
The multivariate normal distribution of a
with
and
Fact: for a full-rank matrix
Linear Regression¶
The basic idea of linear regression is to assume that
where
where
We know that (under squared error loss) the oracle predictor is
The goal is to find
Rewrite Training Data in Matrix Form¶
Assume training data
and design matrix
Then training data can written as
Now we want to use training data to estimate
Recap on Likelihood Function¶
The likelihood function
More precisely, assume we have
We view the observed samples are realizations of some random variables
Note that the likelihood function is NOT a probability density function. It measures the support provided by the data for each possible value of the parameter. If we compare the likelihood function at two parameter points and find that
then the sample we actually observed is more likely to have occurred if
With the assumption that samples are drawn i.i.d., the joint probability is given by multiplied probabilities, i.e.
hence taking the log transforms this into a summation, which is usually easier to maximize analytically. Thus, we often write the log-likelihood rather than the likelihood.
Estimate the Coefficients ¶
Here we find an estimate of
By
and thus, the (log-)likelihood function is given by the density function of the above multivariate normal distribution.
By maximizing the likelihood function, we can estimate
which becomes a least squares problem. We take a gradient with respect to
where we assume
Furthermore, we can find the distribution of
which indicates that
and therefore,
Evaluate OLS Prediction Estimate¶
The above estimation is called ordinary least squares (OLS) in statistics. Now we evaluate our estimation when receiving a new data pair
Consider the bias of
We know that
Now, we consider the variance:
Roughly speaking, if
Interval Estimate¶
We want to make a prediction interval for
We hope
Recall the assumption that
and
we know that
To obtain the interval estimation, we also need to get rid of
Then we have
follows a t-distribution with
that is
Examples and Discrete Features¶
So far we assume
Example Polynomial Regression (
The model takes
and the model becomes
use this new feature in a regression. For instance, assume we have 4 data pairs for training and use a 2-degree polynomial in regression: training data is given by
We first map the following training data to the design matrix
The OLS estimate is still
Example Categorical Feature:
A categorical feature could be, for instance 3 categories as
We transform categorical
First way is the dummy variable method: By choosing a baseline class ( say class '3'), we transform
and the regression model becomes
where we have
Another way is more symetric -- we have the model
where we have
Assume we have training data:
and the corresponding design matrix is
Example Mixed Features
We can combine continous and categorical features. For instance, if
Assume the training data is
then the design matrix is given by
Example Intersactions between Features
We can also have intersactions between features. For instance, if
Remark Why don't we just use lots of features? I.e. in polynomial regression, why don't we take degree
Variable Selection¶
Assume we have many features, i.e.
Then we can obtain a submodel with features in
Because of the problem of overfitting, we may not want to use empirical risks to evaluate the submodel (the empirical risk will always decrease when adding new features). A better idea is to add a penalty for including new features:
where if
We want to choose proper index set
is an index set of 5 features where the second and fifth feature are included. In this way, Minimizing AIC is the same as searching for the optimal vertex on the cube.
For
Forward selection:
-
Start with trivial null model
. -
Add a feature by searching for the model
over that has the smallest AIC/BIC. Call this best to be . -
Continue by find an
where the model maximizes AIC/BIC. Call this to be . -
Keep on doing this until we reach the full model.
-
Choose the model in our sequence with the smallest AIC/BIC.
Backward selection: start at full model and remove variables until reaching the null model.
Best subset selection needs to fit
models, which is an order of