πŸ“Š

FDSI-8 Predictor Selection

Intro
The process of predictor selection in regression consists of making a decision among all of the possible subsets of predictors to find that model with the best ability to generalize to the population from which the training sample was drawn.
AIC is defined as
In the case of linear regression, when the errors are assumed to be i.i.d. normal, we can show that
The second term does not vary across the models under consideration, so a common definition of AIC is
where is the number of parameters in the model.
While this form is theoretically justified by a normality assumption on the , it is still typical to see it used in more general situations. However, if the errors are far from normal, the validity of this form for AIC should be questioned.

Stepwise Regression

In cases where the number of predictors is small, an exhaustive search over all possible models is possible. For example, if we have candidate predictors, models will be compared in the exhaustive search.
An exhaustive search is not feasible when there is a large number of predictors. A classic strategy for dealing with the large number of possible models is to use a stepwise selection procedure.
Starts with the largest model under consideration, and then takes a sequence of steps: At each step one predictor is either added or dropped, depending on if a forward or backward selection procedure is used, respectively.
The decision for which predictor to added or dropped is based on AIC: we take the step which reduces AIC the most. Once AIC no longer changes, the process stops, and we obtain the fianl model.
features_chosen = [] remaining_features = list(Xfull.columns) best_AIC = float('inf') for i in range(1, len(remaining_features) + 1): add_feature = False for combo in itertools.combinations(remaining_features, 1): X_with_ones = sm.add_constant(Xfull[list(combo) + features_chosen]) model = sm.OLS(Y, X_with_ones).fit() this_AIC = model.aic if this_AIC < best_AIC: add_feature = True best_AIC = this_AIC best_feature = combo[0] if add_feature: features_chosen.append(best_feature) remaining_features.remove(best_feature)

Cross Validation

An alternative for AIC is leave-one-out cross validation.
In this approach, one imagines refitting the model times, each time excluding one of the observations.
At iteration , observation is excluded. The fitted model is then used to predict the response for this observation, called .
Finally, we calculate the quantity
where PRESS stands for Prediction Error Sum of Squares.
Note that for each , , because when observation is added into the training set, the model works to β€œget closer” to . Therefore
that is, RSS β€œunderestimates” the prediction error.
To calculate PRESS, we do not need to fit models. Let be a vector consisting of the reponses, and let be a vector consisting of the fitted values (from the full model).
If it is the case that (i.e. there exists such ), then
where is the -th diagonal element of .
In the case of linear regression, there exists such .
Another alternative for AIC can be LASSO
PRESS in Python
import statsmodels.stats.outliers_influence as outliers_influence levs = outliers_influence.OLSInfluence(Model).hat_matrix_diag PRESS = ((Model.resid / (1 - levs))**2).sum()

AIC v.s. PRESS

AIC and PRESS will often give similar result for the model choice.
AIC:
  • Pros:
    • AIC is more stable
    • In general, AIC will be eaiser to calculate than PRESS
    • AIC requires there be a likelihood function, but not that the observations be i.i.d.
  • Cons:
    • The theory behind AIC is based on the assumption that the distributional assumption for the errors is correct
PRESS:
  • Pros:
    • The quantity being estimated by PRESS is a very natural measure of prediction performance.
  • Cons:
    • Not stable, the estimation of expected prediction is subject to large variance. PRESS may not perform well for small sample sizes.
    • Although PRESS is simple to calculate in the case of linear regression, but may be difficult for other models
    • The cross-validation assumes that observations are i.i.d. However, for example, cross-validation with time series data will fail.
Β 

Loading Comments...