β™ŸοΈ

MLI-3

Overview: Tree-based methods

Tree-based methods for predicting from a feature vector divide up the feature space into rectangles, and then fit a very simple model in each rectangle. This works both when is discrete and continuous, i.e., both for classification and regression.
comparison between linear regression (left column) and decision tree (right column). Green and yellow areas stand for different classes.
comparison between linear regression (left column) and decision tree (right column). Green and yellow areas stand for different classes.

Tha CART Algorithm

The CART algorithm chooses the splits in a top down fashion: first chooses the first variable to be the root, then the variable at the second level, etc.
At each state we choose the split to achieve the biggest drop in misclassification error (a greedy strategy). In terms of tree depth, the strategy is to grow a large tree and then prune at the end.
Recall that in a region , the proportion of points in class is
The CART algorithm begins by considering splitting on variable and split point , and defines the regions
We then greedily chooses (which feature & what threshold) by minimizing the misclassification error
where is the most common class in and is the most common class in .
Then, repeat this within each of the newly defined regions . Again consider all variables and split points for each of , greedily choosing the biggest improvement in misclassification error.
Continuing on in this manner, we will get a big tree . Its leaves define regions . We then prune this tree, meaning that we collapse some of its leaves into the parent nodes.
For any tree , let denote its number of leaves. Define
We seek the tree that minimizes . It turns out that this can be done by pruning the weakest leaf one at a time. Note that is a tuning parameter, and a larger yields a smaller tree. CART picks by 5- or 10-fold cross-validation.
Other impurity measures
notion image
denote as the fraction of observation in region belonging to class .
  • Gini index:
  • Entropy:

Regression Trees

For prediction of a continuous outcome instead of a class label, everything essentially follows as before, but now we just fit a constant inside each rectangle.
notion image
The estimated regression function has the form
such that when , . The quantites are no longer predicted classes, but instead they are real numbers: the average response within each region:
The main difference in building the tree is that we use squared error loss instead of misclassification error (or Gini index or entropy) to decide which region to split.
Nice properties of trees
  • The ability to tune bias and variance
  • Interpretability
  • Automatic handling of variable transformations
  • Automatic detection of interactions
However, trees tend to suffer from high variance for a given amount of flexibility because they are quite unstable: a smaller change in the observed data can lead to dramatically different sequence of splits, and hence a different prediction. This instability comes from the hierarchical nature of the greedy search: once a split is made, it is permanent and can never be β€œunmade” further down in the tree.

Bootstrap

Basic Idea

The bootstrap is a fundamental resampling too in statistics. The basic idea underlying the bootstrap is that we can estimate the true distribution by the so-called empirical distribution . Given the training data , , the empirical distribution function is simply
This is just a discrete probability distribution, putting equal weight on each of the observed training points.
With the bootstrap, we are approximating the true distribution by a discrete distribution over the original sample data points. We can repeat this process for multiple times (parallelly) to obtain more data.
A bootstrap sample of size from the training data is
where each are drawn from uniformly at random from with replacement. This corresponds exactly to independent draws from . Hence it approximates what we would see if we could sample more data from the true . When , it is like sampling an entirely new training set.
Note that not all of the training points are represented in a bootstrap sample, and some are represented more than once. When , about 36.8% of points are left out, for large , because
These left out points can be used as a validation set.

Bagging

Bootstrapping gives us an approach to variance reduction. We can draw many bootstrap data sets, fit a new tree on each one, and β€œaverage” their predictions. Their combination will be stable.
Given a training data , , bagging averages the predictions from predictors (here trees) over a collection of bootstrap samples.
For , we draw bootstrap samples , , and we fit a classification tree on each sampled data set.
To classify an input , we simply take the most commonly predicted class:
This is just choosing the class with the most votes.
Remark: the variance will decrease and the bias will stay the same when we average many things together.
Note that the trees trained on different data bootstrapped are not independent.
Disadvantages of Bagging
Loss of interpretability: the final bagged classifier is not a tree, and so we forfeit the interpretative ability of a classification tree
Computational complexity: we are essentially multiplying the work of growing a single tree by

Random Forests: Improved Bagging

Randomness of Random Forests

Random forests have two dimensions of randomness:
  • bootstrap (row level)
  • variable (column level)
Random forests are a simple modification of bagged trees. To improve bagging, we want to make our trees more independent while using the same data.
Imagine bagging when you have a few strong variables. The same variables can dominate the splits across the trees, highly correlating their predictions.
To make the trees more independent, we only allow a small subset of variables to be considered at each split. At each split, variables are selected, and a split is chosen from among those variables. This makes the trees more varied and more independent, at the cost of making our individual greedy optimizations worse. However, the result is a model with much better predictions, and a nice balance among correlated variables.

Variable Importance

Random forests don’t explicitly do variable selection like the Lasso, but their individual trees tend to rely more on informative variables when they search for the next split.
There are two popular ways to measure variable importance:
  • For each variable, measure the amount that RSS (or Gini index) decreases due to splits in that variable. Average this over all trees in the forest.
  • Randomly permute each variable (one at a time) and see how much the model performance decreases.

Boosting

Like Bagging, Boosting is an approach to combining many weak estimators into a powerful estimator. In Bagging, we want our trees to be as independent as possible, so that the average would have low variance.
In Boosting, we will build a sequence of simple prediction functions (usually trees). Each function will try to do well on observations that the previous functions did poorly on. This makes the individual functions very dependent. However, it can have great performance.

AdaBoost

Initialize: Weights
For :
  • Fit classification tree to the training data with weights .
  • Compute weighted misclassification error:
  • Define
  • Update the observation weights:
Result:
As for the observation weights update:
  • If was guessed wrong,
  • If was guessed right,
  • If is big, the classifier did its job (error is small), and the weights increase (for those misclassified samples).
Intuition of AdaBoost
AdaBoost builds a sequence of trees, each of which tries to classify well on points that were missed by the earlier trees. The observation weights, , focus later classifier on difficult points.
The classifier weights capture how well each component classifier does its weighted task, and is used for both
  • Adjusting the observation weights
  • Weighting the components in the final classifier
While 0-1 loss is naive, it hard to optimize. We here switch 0-1 loss for exponential loss, which is easier to optimize and has similar behavior. The exponential function is continuous, differentiable, convex, and is still monotonically decreasing. If we can make the exponential function small, the 0-1 loss will also be good.
notion image
Then we need to solve
We can use the greedy algorithm to optimize one at a time while holding the previous functions fixed, and never come back.
Boosting for Regression Trees
Procedure:
  • Set and for all in the training data
  • For :
    • Fit a regression tree to training data ,
    • Update by adding in a shrunken version of the new tree:
    • Update the residuals
  • Output the boosted model:
Remark
In Bagging (and Random Forests), we use deep trees to stay unbiased, and then rely on averaing to reduce variance. In Boosting, we tend to use very simple estimators, like very shallow trees. These have low variance, but high bias. Since the sequence of trees can adjust for previous errors, the bias can be corrected.
Besides, the shallower trees can also lead to computational speedups in evaluation because we don’t have to evaluate 500 deep trees. We can think of the depth of our trees as allowing interactions. Two splits let we interact two variables. We often find that we want somewhere between 2 and 10 splits in the trees.
Boosting is effectively fitting an additive expansion in a set of β€œbasis functions” defined by weak learners. Basis function expansions typically look like:
  • for are the expansion coefficients
  • are simple functions of .

Extreme Gradient Boosting (XGBoost)

Features:
  • More regularization: includes additional regularization hyperparameters on tree complexity
  • Early stopping: implements early stopping checks to stop model building when additional trees offer no improvement
  • Custom loss functions: users allowed to define and optimize with custom objective and evaluation criteria
  • Continue with existing model: fit initial model, save it, then return to that model
  • Computationally efficient
Β 

Loading Comments...