Week 1¶
Requirement¶
Statistics: Multiple Regression, Bias-var Decomposition
Calculus: Interated Intigration
Supervised Learning¶
Suppose we have data pairs
where \(P\) is some distribution.
Our Goal¶
Find a function \(f:\mathcal{X}\rightarrow \mathcal{Y}\) such that \(f(x)\approx y\) when \((x,y)\sim P\), where
- \(x\) is called: predictor variable / covariates / independent var. / inputs / features
- \(y\) is called: response var. / output var. / dependent var.
Examples:¶
- \(\mathcal{Y} = \mathbb{R}\): Regression Problem
- \(\mathcal{Y}\) is a finite set: Image Classification
Learning Algorithm¶
To obtain an \(f\) we use the training data to output on Learning algorithm
where \(\mathcal{Y}^\mathcal{X}\) is the set of all functions from \(\mathcal{X}\) to \(\mathcal{Y}\). Therefore,
is a random function(1) determined by the data and the learning algorithm \(\phi_n\).
- Note that a random function is a deterministic function. More precisely, a function of an arbitrary argument \(t\) (defined on the set \(T\) of its values, and taking numerical values or, more generally, values in a vector space) whose values are defined in terms of a certain experiment and may vary with the outcome of this experiment according to a given probability distribution.
Definition Loss function¶
Loss function \(L: \mathcal{Y} \times \mathcal{Y} \rightarrow [0,+\infty)\) is to estimate the error of \(f\). How close \(f(x)\) is to \(y\) is gauged by \(L(f(x),y)\). Examples: squared error loss
and \(0-1\) loss
where \(I\) is the indicator function.
Definition Risk function¶
- Risk function (a.k.a. prediction error)
- Oracle prediction error (a.k.a. Bayes Risk):
- Oracle predictor \(f^*\) satisfies
Compute the Oracle Predictor¶
Compute \(R^*(P)\) for squared error loss:
For fixed \(x\), we minimize over the value of \(f(x)\), that is, it's suffice to set
It is equivalent to minimize:
For the above equation, the first term is independent on \(z\) and
for fixed \(x\). Then we have
Therefore, oracle predictor is given by \(f^*(\tilde{x})=E(y|X=\tilde{x})\). Similar arguments can be obtained for other loss functions.
Additionally, our computation shows that making assumptions about the allowable \(f\), i.e. assume \(f\) lies in some set of functions \(\mathcal{F}\), is somewhat equivalent to making assumptions about \(P\).
Making Assumptions¶
Ideally we would like a \(f\) such that \(R(f,P)\) is small for all distribution \(P\). However, this is not possible by the No Free Lunch Theorem. Roughly, this says that for any \(f\), there exists a \(P\) such that \(R(f,P)\) is large. In classification, this says that there exists a \(P\) such that \(f\) is no better than random guessing(1).
- Random guessing is that we flip a coin and predict \(y\) based on the coin being heads or tails.
Our solution is to make assumptions about \(P\):
Assume \(P\in\mathcal{P}\), where \(\mathcal{P}\) is some subset of probability distributions. This suggest assumptions that we can make about the function class of predictors that we use. For example, \(E(y|x)\) is optimal for squared error loss. Given \(\mathcal{P}\) we may want to restrict \(\mathcal{F}\) to functions that have the form \(E(y|x)\) for \(P\in \mathcal{P}\).
Complexity (bias-variance) tradeoff:¶
- More complex function classes \(\mathcal{F}\) -- low bias (i.e. able to approximate the oracle \(E(y|x)\))
- Large class of \(\mathcal{F}\) -- it is hard to find the best \(f\).
Error Decomposition¶
For \(f\in \mathcal{F}\), since \(\mathcal{F}\) may not be large enough, \(R^*(P)\) and \(\inf_{f\in \mathcal{F}} R(f,P)\) may not be equal. We have the following decomposition of the risk function \(R(f,P)\):
- \({\color{red} \text{Red}}\): Estimation error, which is non-negative.
- \({\color{green} \text{Green}}\): Approximation error, which is non-negative.
- \({\color{blue} \text{Blue}}\): the inherent error, which is the best we can do!
Empirical Risk Minimization (ERM)¶
Idea is to find an approximation of \(R(f,P)\) and minimize this over \(f\) (also assume that \(f\) lies in some specified class of functions \(\mathcal{F}\)). The ERM predictor \(\hat{f}\) is defined as
which is called the average loss or empirical risk. Note that
- The empirical risk depends on the choice of function class \(\mathcal{F}\), such as linear functions and complicated neural networks.
- Finding the
argmin
is not always easy, and hence there are various optimization algorithms approximating theargmin
.
Gauge Learning Algorithms¶
Given a learning algorithm \(\phi_n\), how can we gauge the performance of \(\phi_n\)? We can look at \(R(\hat{f},P)\), that is, we view the training data \(\mathcal{D}_n\) as random and then \(R(\hat{f},P)\)(1) is a random variable. The expected risk (a.k.a. expected prediction error) is given by
- Here \(\hat{f}\) is dependent on \(\mathcal{D}_n\), which should be written as \(\hat{f}_{\mathcal{D}_n}\), but for simplicity, we still denote it as \(\hat{f}\).
There are two sources of randomness:
- The training data \(\mathcal{D}_n\) that determines \(\hat{f}\).
- The pair \((x,y)\) where we predict \(y\) using \(\hat{f}(x)\).