TOC
1. Bayesian discriminant rule / Bayes classifier
Same setup with mixture models:
- groups with prespecified
- each group has its own dirtribution with own parameters
- the density of each class is
- prior probability of group is with
- marginal density is the mixture
The posterior probability of group is then
which provides a โsoftโ classification.
The discriminant function is the logarithm of the posterior probability:
Dropping constant terms which do not depend on , we get
The hard classification is then
The discriminant functions can also be mapped back to the probabilistic class assignment by using the softmax function:
Subtracting avoids numerical overflow problems.
2. Quadratic discriminant analysis (QDA)
QDA is a special case of the Bayes classifier when all densities are multivariate normal with
Then the discriminant function of QDA is:
Note that:
- terms that do not depend on are dropped, such as
- the appearance of the squared Mahalanobis distance between and :
- The QDA discriminant function is quadratic in . This implies that the decision boundaries for QDA classification are quadratic.
Besides, we can multiply the discriminant funciton by -2 to get rid of the factor , but we then need to find the minimum of the discriminant function rather than the maximum:
3. Linear discriminant analysis (LDA)
LDA is a special case of QDA, with the assumption of common overall covariance across all groups: . This leads to a simplified discriminant function:
The function can be further simplified, by noting that the quadratic term does not depend on and hence can be dropped:
Thus, the LDA discriminant function is linear in , and hence the resulting decision boundaries are linear as well.
4. Diagonal discriminant analysis (DDA)
DDA simplifies the QDA even further by additionally requiring a diagonal covariance containing only the variances (which means all correlations among the predictors are zero)
This simplies the inversion of as
And thus the discriminant function is
DDA is also a linear classifier and has linear decision boundaries.
The Bayes classifier (using any distribution) assuming uncorrelated predictors is also known as the naive Bayes classifier. Hence, DDA is a naive Bayes classifier assuming underlying Gaussian distribution.
5. Training
Number of model parameters
- For QDA, LDA and DDA, we need to learn with and the mean vectors
- For QDA, we additionally require
- For LDA, we need
- For DDA, we estimate
QDA:ย
LDA:ย
DDA:ย
When , the parameters needed by these models are shown as below
Large sample size
If the sample size of training data set is sufficiently large compared to the model dimensions we can use maximum likelihood (ML) to eatimate the model parameters. From the above fig. we can know that QDA and LDA need larger sample size to be able to use ML while DDA only needs relatively small sample size to be sufficient.
Assume we have samples with labels . is the set of all indices of training sample belonging to group . is the sample size in group .
The ML estimates of the class probabilities are the frequencies
the ML estimates of the group means are
the ML estimate of the global mean is
The ML estimates of the covariances for QDA are
The ML estimate of for LDA is
Small sample size
Modern statistics has developed many different (but related methods) for use in high-dimensional small sample settings:
- regularised estimators
- shrinkage estimators
- penalised maximum likelihood estimators
- Bayesian estimators
- Empirical Bayes estimators
- KL / entropy-based estimators
Prediction error
A measure of predictor error compares the predicted label with the true label for validation data.
For continuous outcomes, the squared loss is often used:
For binary outcomes, the 0/1 loss is often used:
The empirical mean prediction error is
where is the sample size of the validation data set.
Prediction error using Cross-validation
Quite often we do not have separate validation data available to evaluate a classifier. In this case, we can use cross-validation. Outline of cross-validation is:
- split the samples in the training data into a number (say ) parts (folds).
- use each of the folds as validation data and the other folds as training data.
- average over the resulting individual estimates of prediction error, to get an overall aggregated predictor error, along with an error.
ย
Loading Comments...