πŸ“Š

FDSI-8 Predictor Selection

Intro
The process of predictor selection in regression consists of making a decision among all of the possible subsets of predictors to find that model with the best ability to generalize to the population from which the training sample was drawn.
AIC is defined as
AIC=βˆ’2log⁑L(ΞΈ)+2p\text{AIC}=-2\log L(\theta)+2p
In the case of linear regression, when the errors are assumed to be i.i.d. normal, we can show that
βˆ’2log⁑L(ΞΈ^)=nlog⁑(RSS)+n(log⁑(2Ο€/n)βˆ’0.5)-2\log L(\hat{\theta})=n\log(RSS)+n(\log(2\pi/n)-0.5)
The second term does not vary across the models under consideration, so a common definition of AIC is
AIC=nlog⁑(RSS)+2p\text{AIC}=n\log(RSS)+2p
where pp is the number of Ξ²\beta parameters in the model.
While this form is theoretically justified by a normality assumption on the Ο΅\epsilon, it is still typical to see it used in more general situations. However, if the errors are far from normal, the validity of this form for AIC should be questioned.

Stepwise Regression

In cases where the number of predictors is small, an exhaustive search over all possible models is possible. For example, if we have pp candidate predictors, 2p2^p models will be compared in the exhaustive search.
An exhaustive search is not feasible when there is a large number of predictors. A classic strategy for dealing with the large number of possible models is to use a stepwise selection procedure.
Starts with the largest model under consideration, and then takes a sequence of steps: At each step one predictor is either added or dropped, depending on if a forward or backward selection procedure is used, respectively.
The decision for which predictor to added or dropped is based on AIC: we take the step which reduces AIC the most. Once AIC no longer changes, the process stops, and we obtain the fianl model.

Cross Validation

An alternative for AIC is leave-one-out cross validation.
In this approach, one imagines refitting the model nn times, each time excluding one of the nn observations.
At iteration ii, observation ii is excluded. The fitted model is then used to predict the response for this observation, called y^(βˆ’i)\hat{y}_{(-i)}.
Finally, we calculate the quantity
PRESS=βˆ‘i=1n(yiβˆ’y^(βˆ’i))2\text{PRESS}=\sum_{i=1}^n(y_i-\hat{y}_{(-i)})^2
where PRESS stands for Prediction Error Sum of Squares.
Note that for each ii, (yiβˆ’y^i)2≀(yiβˆ’y^(βˆ’i))2(y_i-\hat{y}_i)^2\le (y_i-\hat{y}_{(-i)})^2, because when observation ii is added into the training set, the model works to β€œget closer” to yiy_i. Therefore
RSS≀PRESSRSS\le PRESS
that is, RSS β€œunderestimates” the prediction error.
To calculate PRESS, we do not need to fit nn models. Let Y\bold{Y} be a vector consisting of the nn reponses, and let Y^\hat{\bold{Y}} be a vector consisting of the nn fitted values (from the full model).
If it is the case that Y^=HX\hat{\bold{Y}}=\bold{H}\bold{X} (i.e. there exists such H\bold{H}), then
yiβˆ’y^(βˆ’i)=Ο΅^i1βˆ’hiiy_i-\hat{y}_{(-i)}=\frac{\hat{\epsilon}_i}{1-h_{ii}}
where hiih_{ii} is the ii-th diagonal element of H\bold{H}.
In the case of linear regression, there exists such H=X(XTX)βˆ’1XT\bold{H}=\bold{X}(\bold{X}^T\bold{X})^{-1}\bold{X}^T.
Another alternative for AIC can be LASSO
Loss function=βˆ‘i=1n(yiβˆ’y^i)2⏟RSS+βˆ‘j=1p∣βj∣\text{Loss\ function}=\underbrace{\sum_{i=1}^n(y_i-\hat{y}_i)^2}_{RSS}+\sum_{j=1}^p|\beta_j|
PRESS in Python

AIC v.s. PRESS

AIC and PRESS will often give similar result for the model choice.
AIC:
  • Pros:
    • AIC is more stable
    • In general, AIC will be eaiser to calculate than PRESS
    • AIC requires there be a likelihood function, but not that the observations be i.i.d.
  • Cons:
    • The theory behind AIC is based on the assumption that the distributional assumption for the errors is correct
PRESS:
  • Pros:
    • The quantity being estimated by PRESS is a very natural measure of prediction performance.
  • Cons:
    • Not stable, the estimation of expected prediction is subject to large variance. PRESS may not perform well for small sample sizes.
    • Although PRESS is simple to calculate in the case of linear regression, but may be difficult for other models
    • The cross-validation assumes that observations are i.i.d. However, for example, cross-validation with time series data will fail.