04 Supervised Learning and Classification

TOC

1. Bayesian discriminant rule / Bayes classifier 2. Quadratic discriminant analysis (QDA)3. Linear discriminant analysis (LDA)4. Diagonal discriminant analysis (DDA)5. Training

1. Bayesian discriminant rule / Bayes classifier

Same setup with mixture models:

groups with prespecified

each group has its own dirtribution with own parameters

the density of each class is

prior probability of group is with

marginal density is the mixture

The posterior probability of group is then

which provides a “soft” classification.

The discriminant function is the logarithm of the posterior probability:

Dropping constant terms which do not depend on , we get

The hard classification is then

The discriminant functions can also be mapped back to the probabilistic class assignment by using the softmax function:

Subtracting avoids numerical overflow problems.

2. Quadratic discriminant analysis (QDA)

QDA is a special case of the Bayes classifier when all densities are multivariate normal with

Then the discriminant function of QDA is:

Note that:

terms that do not depend on are dropped, such as

the appearance of the squared Mahalanobis distance between and :

The QDA discriminant function is quadratic in . This implies that the decision boundaries for QDA classification are quadratic.

Besides, we can multiply the discriminant funciton by -2 to get rid of the factor , but we then need to find the minimum of the discriminant function rather than the maximum:

3. Linear discriminant analysis (LDA)

LDA is a special case of QDA, with the assumption of common overall covariance across all groups: . This leads to a simplified discriminant function:

The function can be further simplified, by noting that the quadratic term does not depend on and hence can be dropped:

Thus, the LDA discriminant function is linear in , and hence the resulting decision boundaries are linear as well.

4. Diagonal discriminant analysis (DDA)

DDA simplifies the QDA even further by additionally requiring a diagonal covariance containing only the variances (which means all correlations among the predictors are zero)

This simplies the inversion of as

And thus the discriminant function is

DDA is also a linear classifier and has linear decision boundaries.

The Bayes classifier (using any distribution) assuming uncorrelated predictors is also known as the naive Bayes classifier. Hence, DDA is a naive Bayes classifier assuming underlying Gaussian distribution.

5. Training

Number of model parameters

For QDA, LDA and DDA, we need to learn with and the mean vectors

For QDA, we additionally require

For LDA, we need

For DDA, we estimate

QDA:

LDA:

DDA:

When , the parameters needed by these models are shown as below

Large sample size

If the sample size of training data set is sufficiently large compared to the model dimensions we can use maximum likelihood (ML) to eatimate the model parameters. From the above fig. we can know that QDA and LDA need larger sample size to be able to use ML while DDA only needs relatively small sample size to be sufficient.

Assume we have samples with labels . is the set of all indices of training sample belonging to group . is the sample size in group .

The ML estimates of the class probabilities are the frequencies

the ML estimates of the group means are

the ML estimate of the global mean is

The ML estimates of the covariances for QDA are

The ML estimate of for LDA is

Small sample size

Modern statistics has developed many different (but related methods) for use in high-dimensional small sample settings:

regularised estimators

shrinkage estimators

penalised maximum likelihood estimators

Bayesian estimators

Empirical Bayes estimators

KL / entropy-based estimators

Prediction error

A measure of predictor error compares the predicted label with the true label for validation data.

For continuous outcomes, the squared loss is often used:

For binary outcomes, the 0/1 loss is often used:

The empirical mean prediction error is

where is the sample size of the validation data set.

Prediction error using Cross-validation

Quite often we do not have separate validation data available to evaluate a classifier. In this case, we can use cross-validation. Outline of cross-validation is:

split the samples in the training data into a number (say ) parts (folds).

use each of the folds as validation data and the other folds as training data.

average over the resulting individual estimates of prediction error, to get an overall aggregated predictor error, along with an error.