Skip to content

Week3

Recap on Singular Value Decomposition

Orthogonal Matrix

Let VRp×k, given by

V=[v1vk].

Then V has orthogonal columns if

viTvj={1, if i=j,0, if ij.

For such V, we have VTV=Ik. Note that it is often NOT the case that VVT=Ip.

A square matrix VRk×k with orthogonal columns is called an orthogonal matrix. Note that V1=VT.

Definition Thin SVD

For a matrix ARp×q (pq), the thin SVD of A is a representation of the form

A=VDUT,VRp×q,DRq×q,URq×q,

whrere V and U have orthogonal columns and D is diagonal.

The column vectors of V and U are respectively called the left and right singular vectors. The singular values of A are the diagonal elements of D.

Remarkable results: Every matrix ARp×q has an SVD.

Spectral Decomposition

A spectral decomposition of a symmetric matrix ARp×p is a representation of A as

A=VDVT,VRp×p,DRp×p,

where V=[v1v2vp] is orthogonal and D=diag{d1,d2,,dp} is diagonal.

The columns of V are the eigenvector of A and di is the associated eigenvalue:

Avi=VDVTvi=divi.

Remarkable results (spectral theorem): Every symmetric matrix ARp×p has a spectral decomposition.

Recap on Multivariate Statistics

Notations

Let X be a k-dimensional random vector (a special case of random matrices)

X=[X1Xk],

or a p×q random matrix

X=[X11X1qXp1Xpq],

where each Xi or Xij is a random variable. E(X) is defined componentwise, i.e.

E(X)=[E(X11)E(X1q)E(Xp1)E(Xpq)],

and E(X) has linearity: for constant matrices ARd×p, $B\in\mathbb{R}^{q\times q}, CRd×q,

E(AXB+C)=AE(X)B+C.

For random vectors XRp, its covariance matrix is defined as the p×p symmetric matrix Cov(X), given by

[Cov(X)]ij=Cov(Xi,Xj).

Similar to the covariance of random variable, we have

Cov(X)=E((XE(X))(XE(X))T)=E(XXT)E(X)E(X)T.

From the above, we have

Cov(AX)=ACov(X)AT.

Multivariate Normal Distribution

The multivariate normal distribution of a k-dimensional random vector X=(X1,,Xk)T can be written in the following notation:

XN(μ,Σ) or XNk(μ,Σ),

with k-dimensional mean vector

μ=E(X)=[E(X1)E(X2)E(Xk)]

and k×k covariance matrix

Σi,j=E((Xiμi)(Xjμj))=Cov(Xi,Xj),i,j=1,,k.

Σ is assumed to be positive definite (i.e. non-degenerate) and therefore, Σ1 is also positive definite. In this case, the density of X is given by

p(z)=1(2π)kdet(Σ)exp(12(zμ)TΣ1(zμ)).

Fact: for a full-rank matrix ARp×q (qp) and bRq, we have

AX+bNk(Aμ+b,AΣAT).

Linear Regression

The basic idea of linear regression is to assume that

yβ0+i=1pβixi,

where X=[x1,x2,,xp]T (for now we assume xRp). More precisely, we make a assumption about p(yx) as follows:

y=β0+i=1pβixi+ε,

where ε is noise with εN(0,σ2) and x is independent of the noise. Thus, we have

yxN(β0+i=1pβixi,σ2).

We know that (under squared error loss) the oracle predictor is

E(yx)=β0+i=1pβixi=f(x).

The goal is to find β0,,βp and thus f.

Rewrite Training Data in Matrix Form

Assume training data {(x(i),y(i))}i=1n were i.i.d. generated via the assumption about p(yx). To simplify the notations, we define

Y=[y(1)y(n)],ε=[ε(1)ε(n)],β=[β(0)β(n)],

and design matrix XRn×(p+1) given by

X=[1(x(1))T1(x(2))T1(x(n))T].

Then training data can written as

Y=Xβ+ε.

Now we want to use training data to estimate β0,,βp and thus f.

Recap on Likelihood Function

The likelihood function L(θx) illustrates the probability of θ given data set x=(x1,,xn)T.

More precisely, assume we have n samples (x1,,xn) observed from a distribution pθ(x) with an unknown parameter θ, and our goal is to estimate θ using the observed data.

We view the observed samples are realizations of some random variables X1,X2,,Xn, which has a joint density function p(X1,,Xnθ). Given X1=x1,,Xn=xn, we may consider the probability of this event being observed, which is the likelihood function (a function of θ) defined by:

L(θ)=L(θx1,,xn)=p(X1=x1,,Xn=xnθ).

Note that the likelihood function is NOT a probability density function. It measures the support provided by the data for each possible value of the parameter. If we compare the likelihood function at two parameter points and find that

L(θ1x)>L(θ2x)

then the sample we actually observed is more likely to have occurred if θ=θ1 than if θ=θ2. This can be interpreted as θ1 is a more plausible value for θ than θ2. Therefore, one approach to estimate θ is to choose the value of θ which gives you the highest probability among all possible values.

With the assumption that samples are drawn i.i.d., the joint probability is given by multiplied probabilities, i.e.

p(X1,,Xnθ)=i=1npθ(Xi),

hence taking the log transforms this into a summation, which is usually easier to maximize analytically. Thus, we often write the log-likelihood rather than the likelihood.

Estimate the Coefficients βi

Here we find an estimate of β by maximum likelihood estimation:

By X is independent of the noise (and vice versa), we have

YXNn(Xβ,σ2In),

and thus, the (log-)likelihood function is given by the density function of the above multivariate normal distribution.

By maximizing the likelihood function, we can estimate β:

β^=argmaxβln(p(YX))=argmaxβ(12σ2(YXβ)T(YXβ))=argmaxβYXβ2,

which becomes a least squares problem. We take a gradient with respect to β and set the gradient for 0, and after solving for β, we have

β^=(XTX)1XTY,

where we assume X is a full-rank matrix and thus XTX has an inverse

.

Furthermore, we can find the distribution of β^ conditional on the training features X:

E(β^X)=E((XTX)1XTYX)=β,

which indicates that β^ is unbiased;

Cov(β^X)=Cov((XTX)1XTYX)=(XTX)1XTCov(YX)X((XTX)1)T=σ2(XTX)1;

and therefore,

β^XNp+1(β,σ2(XTX)1).

Evaluate OLS Prediction Estimate

The above estimation is called ordinary least squares (OLS) in statistics. Now we evaluate our estimation when receiving a new data pair (x,y). We make our prediction by

f^(x)=x~Tβ^,x~=[1x].

Consider the bias of f^ (conditional on X,x),

E(f^(x)X,x)=x~TE(β^X,x)=x~Tβ.

We know that E(yx)=x~Tβ, and thus the OLS prediction is unbiased

.

Now, we consider the variance:

Var(f^(x)X,x)=x~TVar(β^X,x)x~=σ2x~T(XTX)1x~.

Roughly speaking, if x is similar to the training data features, the variance will be small; otherwise, the variance will be large.

Interval Estimate

We want to make a prediction interval for y in the form of

[Clow(Y,X,x),Chigh(Y,X,x)].

We hope y to be contained in this with probability of at least 1α (usually α=0.05 or α=0.01), i.e.

Pr(y[Clow,Chigh]X,x)1α.

Recall the assumption that yx~N(x~Tβ^,σ2). By

E(yx~Tβ^X,x)=0,

and Cov(y,x~Tβ^X,x)=0

and thus

Var(yx~Tβ^X,x)=Var(yX,x)2Cov(y,x~Tβ^X,x)+Var(x~Tβ^X,x)=σ2+σ2x~T(XTX)1x~,

we know that yx~Tβ^ is still normally distributed, given by

yx~Tβ^X,xN(0,σ2(1+x~T(XTX)1x~)).

To obtain the interval estimation, we also need to get rid of σ2 as it is unknown. Recall the assumption that the noise is normally distributed. We can estimate the variance σ2 by the scaled residuals, i.e.

σ^2=YXβ2^np.

Then we have

yx~Tβ^σ^2(1+x~T(XTX)1x~)X,x

follows a t-distribution with np degree of freedom. Let tnp,α/2 and tnp,1α/2 be the α/2 and 1α/2 quantile of the this distribution. Denote σ^2(1+x~T(XTX)1x~) as C. Then, we have

Pr(yx~Tβ^C[tnp,α/2,tnp,1α/2]X,x)1α,

that is

Pr(y[Ctnp,α/2+x~Tβ^,Ctnp,1α/2+x~Tβ^]X,x)1α.

Examples and Discrete Features

So far we assume x lies in Rp. Here we will introduce discrete features and mixed features by giving some examples.

Example Polynomial Regression (X=R):

The model takes x and transforms it to a new feature

x(1,x,x2,,xd),

and the model becomes

y=β0+β1x+β2x2++βdxd+ε,

use this new feature in a regression. For instance, assume we have 4 data pairs for training and use a 2-degree polynomial in regression: training data is given by

[x(1)x(2)x(3)x(4)]=[2135].

We first map the following training data to the design matrix X, given by

[1241111391525].

The OLS estimate is still β^=(XTX)1XTY. This allows us to fit y wit ha non-linear function of x.

Example Categorical Feature:

A categorical feature could be, for instance 3 categories as

x={1, if Age<20,2, if 20Age<40,3, if Age40.

We transform categorical x's via indicator functions.

First way is the dummy variable method: By choosing a baseline class ( say class '3'), we transform

x(1,I(x=1),I(x=2)),

and the regression model becomes

y=β0+β1I(x=1)+β2I(x=2)+ε,

where we have E(yx=3)=β0, E(yx=1)=β0+β1, and E(yx=2)=β0+β2.

Another way is more symetric -- we have the model

y=β0+β1I(x=1)+β2I(x=2)+β3I(x=3)+ε,

where we have E(yx=1)=β0+β1, E(yx=2)=β0+β2, and E(yx=3)=β0+β3. Since β0 is redundant, it can be dropped.

Assume we have training data:

[x(1)x(2)x(3)x(4)]=[1231],

and the corresponding design matrix is

X=[110101100110].

Example Mixed Features

We can combine continous and categorical features. For instance, if x{1,2,3}×R, we can do similar mapping:

x(1,I(x1=1),I(x1=2),x2,x22).

Assume the training data is

[x(1)x(2)x(3)]=[13.52732.1],

then the design matrix is given by

X=[1103.512.251017491102.14.41].

Example Intersactions between Features

We can also have intersactions between features. For instance, if XR2, then we can consider thier multiplication in the mapping:

x(1,x1,x2,x1x2,x12,x22).

Remark Why don't we just use lots of features? I.e. in polynomial regression, why don't we take degree d to be large? That is because problems owith overfitting! This is mentioned in the variance-bias tradeoff.

Variable Selection

Assume we have many features, i.e. X has many columns. Given an index set I=(i1,i2,,ik), where i1i2ik, define

XI=[xi1xik],βI=[βi1βik].

Then we can obtain a submodel with features in I give nby

Y=XIβI+ε.

Because of the problem of overfitting, we may not want to use empirical risks to evaluate the submodel (the empirical risk will always decrease when adding new features). A better idea is to add a penalty for including new features:

AIC(I)=1n(YXIβ^I2+2kσ^2),BIC(I)=1n(YXIβ^I2+ln(n)kσ^2),

where if β^ is estimated by using the full model, σ^ is given by

σ^2=YXIβ^2np.

We want to choose proper index set I to maximize AIC or BIC. For simplicity, we may view I as a string in length p consists of 0 and 1, where 0 indicates the feature is not included and 1 indicates the feature is included: for instance,

(0,1,0,0,1)

is an index set of 5 features where the second and fifth feature are included. In this way, I is a point in {0,1}p and we may visualize the choice of I as picking a vertex of a hypercube. Cube with p=3 Minimizing AIC is the same as searching for the optimal vertex on the cube.

For p>40, search becomes infeasible. If we enumerate the AIC for every I, this called best subset selection. Instead, we can do greedy search over the cube.

Forward selection:

  1. Start with trivial null model y=β0+ε.

  2. Add a feature by searching for the model y=β0+βixi+ε over i that has the smallest AIC/BIC. Call this best i to be i1.

  3. Continue by find an j where the model y=β0+βi1xi1+βjxj+ε maximizes AIC/BIC. Call this j to be i2.

  4. Keep on doing this until we reach the full model.

  5. Choose the model in our sequence with the smallest AIC/BIC.

Backward selection: start at full model and remove variables until reaching the null model.

Best subset selection needs to fit 2p linear models. On the other hand, for forward (backward) selection needs to fit

p+(p1)++1=p(p+1)2

models, which is an order of O(p2) computations.