Skip to content

Week 5

Classification

Let Y is a finite set of classes, and for now assume y(i){0,1}. We want to use x to predict y. We assume

yxBernoulli(g(x)),

where g is an arbitrary function determining the probability of y=1 for given x. Then this Bernoulli distribution is completely general. See an example of this setting in the plane (p=2). Classification

We could try to just do linear regression: under squared error loss, E(yx) is optimal and E(yx)=g(x). However, g(x)[0,1] as it outputs a probability, but xTβ takes values in R.

Logistic Regression

Instead of modeling y(x)=xTβ, we will assume that the probability g(x)

is given by

g(x)=σ(xTβ),

where σ is the logistic function given by

σ(z)=ez1+ez=11+ez.

This is called logistic regression model. Logistic Function

For p=2, the logistic function is given by σ(β0+β1x1+β2x2). The following figure shows an example of σ(x+y). Logistic Function with \(p=2\)

Motivation and Interpretation

On one hand, σ is differentiable and easy to compute with. On the other hand, logistic regression assumes the log odds is a linear function. To further explain this, we need to introduce the following definitions.

If p is a probability, the quantity p/(1p) is called the (corresponding) odds, and can take on any value between 0 and . Values of the odds close to 0 and indicate very low and very high probabilities of default, respectively.

The log odds (or logit) of the probability is the logarithm of the odds, i.e.

logit(p)=logp1p(,+).

Hence, in logistic regression, we make an assumption that the log odds is linearly dependent on x.

Another more insightful motivation connects the logistic function to the exponential family from the prospective of generative models:

Procedure of Logistic Regression

How to use logistic regression for prediction:

  1. Get training data (x(1),y(1)),(x(2),y(2))

  2. Use training data to select a β^, i.e. fit our model to the training data.

  3. Get a new x want to predict the class of y: Compute the estimated probability σ(β^Tx)=eβ^Tx1+eβ^Tx.

    • If σ(β^Tx)>0.5, predict that y is in class 1.

    • If σ(β^Tx)<0.5, predict that y is in class 0.

This procedure is designed to minimize the misclassification error. However, some errors are "worse" than others. For example, for a spam email filter, misclassifying spams as normal emails is unlikely to cause big issue, but it may be terrible in the opposite. To account for this we can modify the 0.5 threshold into a larger number (such as 0.95) when predicting a email to be a spam.

Multinomial Regression

Let y(i) takes on value from multiple classes, we say K-class, and we assume

yxMultinomialK(p1(x),p2(x),,pK(x)).

For p features, we assume x=(1,x1,,xp). Then the multinomial regression is given by

p1(x)=exp((β(1))Tx)1+j=1K1exp((β(j))Tx),pK1(x)=exp((β(K1))Tx)1+j=1K1exp((β(j))Tx),pK(x)=11+j=1K1exp((β(j))Tx),

where we regard the K-th class as the baseline class. It is easy to see p1,,pK are valid probabilities

. A function with a form of p1,,pK1 is called softmax function.

Calculate the Estimator

We use maximum likelihood to estimate β to get β^. We take K=2 as an example.

Assume yxBernoulli(π), i.e. Pr(y)=πy(1π)1y. To apply logistic regression, we assume y(i)Bernoulli(σ((x(i))Tβ)). Then the log-likelihood function is

L(β)=i=1nln(Pr(y(i)x(i),β))=i=1ny(i)ln(σ((x(i))Tβ))+i=1n(1y(i))ln(1σ((x(i))Tβ)).

Then the MLE is defined as

β^MLE=argmaxL(β)=argmin(L(β)).

Unfortunately, unlike linear regression, L(β) here has no closed-form. Therefore, we may apply some other optimization method, such as gradient descent and Newton Raphson Algorithm

.

Brief comparison of GD and NR:

  • NR Usually takes more intelligent steps than GD.

  • NR There is no tuning required while in GD we have to tune the step size.

  • Problem with NR We have to invert the the Hessian matrix at each step. Inverting such a matrix has complexity O(p3). For GD we only have to compute the gradient, which has complexity O(p).