β™ŸοΈ

MLI-1

Supervised Learning

Setup

In the basic prediction/classification ML setup, we observe independent draws from a fixed distribution , with
  • In regression,
  • In classification,
In the simplest setting, we see examples, , … , . From these, we want to build a function that takes vectors to guesses .

Loss Function

The loss function, , represents the cost of guessing when the true value is . Frequently used loss functions:
  • squared error loss
  • 0-1 loss for classification
It is hard to assess directly, since it’s a random variable. Instead, we focus on a summary of that distribution that’s easy to work with. We tend to use expected loss
For regression
The first term does not depend on . Minimizing the other two w.r.t. gives
which is the best possible choice of . (based on the squared error loss). That is, the best we can do, once we have seen , is to look at the conditional distribution of given that , and take the mean.
However, we cannot actually know because we do not know (in the setup, ). Instead, we try to get a good estiamte of it from the data.
Bias-Variance decomposition
To make notation simpler, write for regression funtion, and for our estimate. We are estimating on a separate data set, so is random too. Then,
define . Besides,
Therefore
There is typically a tradeoff between the and
  • If your estimator is too inflexible to approximate the true mean function, it is too biased. For example, a straight line for a function.
  • If your estimator adapts too much to individual training samples and overfits, it has much variance.
Some General Improvements
  • Get mode data: the variace will decrease, the bias will not change too much (imagine a wrong linear model for complex task)
  • Make more / better features: the bias will decrease, the variance will not change too much
  • Use a more flexible model (i.e. more complex): the variance will increase while the bias will decrease
  • Use a mroe regularized model (i.e. more easier): the variance will decrease while the bias will increase

Linear Regression

For machine leanring purposes, we think of regression as a very simple way to approximate . While could be any function, here we decide to approximate it as a linear function of the predictors.
General form
thus, lies in the columns span of (a linear subspace).
notion image
Using least square method, minimizes , which is the euclidian distance from to . Since is confined to the colume space of , this will be minimized at the closest point to in the subspace, meaning that is the orthogonal projection of onto the column span of . So .
It can be proved that the orthogonal projection operator onto the column span of is
then
and therefore the minimizing parameters are given by
We usually also write and call this the hat matrix. Note that is a linear transformation. Recall that is symmetric and idempotent.
Goals in Modeling
In statistics and machine learning, we might fit a function to approximate a target . We could have several goals:
  • Inference: Ask questions about
  • Interpretability: Ask questions about
  • Prediction: Guess using
Much of statistical inference focused on the first one. This course will primarily care about the last two.
Linear Regression as a predictor
If the true is very far from linear, this model will be highly biased.
Even when the true model is exactly linear model, we can still face the dimension curse.
An experiment is shown as below. We have different true linear models (dimension of ). We fit the model using training data and test the MSE of test data (both generated via true linear model). Calculate the MSE of test data as shown in the y-axis. It can be observed that as dimension increases, even the true model is linear model, the MSE blows up.
notion image
The reason for the β€œhigh dimension curse” is that there is potential multicollinearity as dimension increases. When there is multicollinearity, OLS may return erroneously high-value coefficients, making the model’s output sensitive to minor alternations in the input data, and thus increasing the and MSE.
To fix this, we can leverage the tradeoff between bias and variance.
Recall that
since the true model is linear, , thus we can try to reduce , obtaining a biased model but with smaller MSE.

Ridge Regression

One such example is Ridge regression. Ridge regression add a regularization (penalty) term to limit the coefficients. An experiment is shown as follow, (, 10 large true coefficients, 20 small coefficients) at .
notion image
notion image
Model
Bias^2
Var
Pred. error
Linear Reg.
0
0.633
1.633
Ridge Reg.
0.077
0.403
1.48
Ridge regression is like least squares but shrinks the estimated coefficients towards zero. Given a response vector and a predictor matrix , the ridge regression coefficients are defined as
for Ridge regression, the solution for is
the penalty term can only make the inverse part more stable numercially.
Here is a tuning parameter, which controls the strength of the penalty term
  • When , Ridge regression is reduced to OLS
  • When ,
  • For in between, we are balancnig two ideas: fitting a linear model of on , and shrinking the coefficients.
An example, the tradeoff between bias and variance for different lambda values
An example, the tradeoff between bias and variance for different lambda values
Important details: centering
When including an intercept term in the regression, we usually leave this coefficient unpenalized. Hence the ridge regression with intercept solves
If we center the columns of , then the intercept estimate ends up just being , so we usually just assume that have been centered and don’t include an intercept.
Important details: scaling
The penalty term is unfair if the predictor variables are not on the same scale. Therefore, if we know that the variables are not measured in the same units, we typically scale the columns of (to have sample variance 1), and then we perform ridge regression.
We can standarize the feature to achieve centering & scaling
as for the test data, still use the to standarize the data.
Ridge regression coefficients
We see that squishing the coefficients toward zero gives:
  • A decrease in variance
  • A decrease in MSE
  • An increase in bias.
An example is shown below. The red paths correspond to the true nonzero coefficients; the gray paths correspond to true zeros. The vertical dashed line at marks the point above which ridge regression’s MSE starts losing to that of linear regression.
notion image
An important thing to notice is that the gray coefficient paths are not exactly zero; they are shrunken, but sill nonzero. Besides, the Ridge regression seems quite harsh on large, important coefficients. The Ridge regression will eventually leave us with a large linear model, which is not very interpretable when we have many variables.

Lasso Regression

The Lasso estimate is defined as
The squared L2 penalty of ridge regression, has been replaced by an L1 penalty . Two penalty terms’ differences are shown below. While L2 tend to make coefficients not exactly zero, L1 can make coefficients exactly zero.
notion image
The tuning parameter controls the strength of the penalty:
  • If : the Lasso is reduced to OLS
  • If :
For in between these two extremes, we are balancing two ideas: fitting a linear model of on , and shrinking the coefficients. But the nature of the L1 penalty causes some coefficients to be shrunken to zero exactly, which leads to variable selection.
notion image
notion image
Note that, in the Lasso, we get exact zeros. Furthermore, the large coefficients are not impacted nearly as much.
Advantages of Sparsity
We can actually selecte variables through the Lasso and obtain a sparser model. The advantages of sparsity includes:
  • Interpretability: we can understand what the model relies on for prediction (understanding )
  • We might gain some insight into the underlying data (helping to understand )
  • If we’re building a predictive score, we can measure fewer things in the future (simpler to apply later)
Side note: Here is another difference between the ML perspective and statistics perspective. For stats, we usually start with some hypothesis and then use the data to test it. For the ML, however, we dive into tons of data and try to find some hypothesis. Thus, some terminologies like β€œp-values”, β€œconfidence intervals” are not mentioned a lot here.
Important details
We do not penalize our intercept, if one is included. Again, columns are often centered and then intercept omitted.
The penalty term is not fair if the predictor variables are not on the same scale. Hence, we often scale the columns of to have variance 1.
Note that Lasso penalty is convext, but NOT strictly convex.
  • For a convex function, all its local minima are also global minima
  • For a strictly convex function, it has at most one global minimizer
Warning about Lasso
The Lasso can be used as a powerful baseline especially for reducing the dimension. However, the lasso still selects a linear model. It will never do better than the best linear model using our variables could do.
The Lass cannot capture non-linearity or interactions automatically. (although we can manually construct some non-linear terms).

Elastic Net and Coordinate Descent

The coefficients of Elastic Net satisfy
assuming and for each (i.e. we have already done the standardization for each feature.)
Β 
Β 

K-fold Cross-validation

CV Rules

We split the training pairs into parts or β€œfolds” (commonly or ).
notion image
K-fold cross validation considers training on all but the -th part, and then validating on the -th part, iterating over . The error on all thest test sets is averaged. These errors are used to select a final model (e.g. for the Ridge or Lasso). The model is then refit on the entire data set for this value.
Standard errors for cross-validation
For different parameters , we can estimate the standard deviation of ,
  • Firstly, average the validation errors in each fold
    • where is the number of points in the -th fold.
  • Then, compute the sample standard deviation of
  • Finally, we estimate the standard deviation of , i.e. the standard error of by
We can plot the cross-validation error curve with standard errors as below, and choose the optimal .
notion image
The one standard error rule
The one standard error rule is an alternative rule for choosing the value of the tuning parameter, as opposed to minimizing the cross-validation error.
We firstly find the ususal minimizer , and then move in the direction of increasing regularization as much as we can, while keeping the CV error within one S.E. of , i.e.
The idea is to obtain a simpler model (more regularized) while keeping all else equal (up to one standard error).
notion image
Note that simply choosing the minizing the CV error will tend to over-select variables, complicating our interpretation. The one standard error rule is a step on the direction of interpretability.

Choice of K

The larger the choice of , the more iterations needed, and so the more computation. Aside from computation, the choice of affects the quality of our cross-validation error estimates for model assessment. For example
  • : split-sample cross-validation. Our CV error estimates are going to be biased upwards, because we are only training on half the data each time.
  • : leave-one-out cross-validation. Our CV estimates
    • are going to be heavily positively correlated, which can lead to high variance.
Choosing or seems to generally be a good idea. In each iteration, we train on a fraction of about the total training set, so this reduces the bias.
There is less overlap between the training sets across iterations, so the terms in
are not as correlated, and the error estimate has a smaller variance.
Leave-one-out shortcut
Privilege of linear models. For the leave-one-out cross-validation
where is the estimator fit to all but the i-th training pair .
Suppose that our tuning parameter is and is the ridge regression estimator
Then it turns out that
where .

Nonparametric regression

Here are two overall approaches to nonlinear regression.
  • Kernel-based methods: we use a version of local averaing to approximate the function. Includes kernel smoothing, local linear regression. Closely related to -nearest neighbors.
  • Spline-based methods: we fit a flexible piecewise polynomial to the data in a way that compromises between goodness-of-fit and smoothness.

K-Nearest Neighbors

Recall that we want to be a good approximation of . We can actually not assume a strong form for the function, but use the following one
where neighbors of .
KNN can perform well (in low dimensions or with nice distances). However, it might be bothering by how jagged the fit is. We can switch to a smoother notion of neighborhood to get a smoother estimate
wehre
and
where is a hyper-parameter. The idea is to give higher weights to the data closer to .
If has more than one dimension, we should define a kernel to measure higher dimensional similarity.

Smoothing Splines

Instead of kernel smoothing, suppose we try to build a flexible function that
  • approximates our data well
  • is smooth
We could try to encode this as an optimization problem
over twice-differentiable .
  • As gets larger: the function is smoother
  • If : the function has zero second derivative, reduced to OLS
  • As gets smaller: gets wiggly
  • As : the function is just the interpolation (connect every adjacent points)
The minimizer is always a cubic spline.
Assume we have a sequence of knots, throught . A cubic polynomial between each successive pair of knots and connecting to both of them, where . So there will be polynomials, with the first polynomial starting at and the last polynomial ending at . Each successive polynomial must have equal values, derivatives, and second derivatives at their joint knots, that is
.
This can only be achieved if polynomials of degree 3 (i.e. cubic polynomials) or higher are used.
The knot spline can be written as
where
then the function is the same as OLS w.r.t. and predictors.

Additive Models

Suppose , our linear models can be represented as
The models have two properties
  • Additivity: each feature contributes additively to the prediction
  • Linearity: the value of each feature enter linearly
It turns out that linearity can be relaxed without sacrificing much efficiency.
Additive models
we can choose the in many different ways, and we can even vary the choice for each feature. In practice, fitting smoothing splines for each is most common. This is waht the LinearGAM function does in pygam.

MSE for Additive Models

You need a lot more data to effectively achieve the similar MSE for more complex models as you need for simpler models,
  • To estimate a function linearly, we expect
  • To estimate a 1-dimensional function nonparametrically, we expect
  • To estimate a p-dimensional function nonparametrically, we expect
To estimate a p-dimensional additive model, however, we effectively estimate each piece separately, giving .
Β 

Loading Comments...