Generative Models
The expected misclassification error,
is minimized by the ideal rule:
if we knew for all . This is called the Bayes Classifier, and the error rate it achieves is the Bayes Rate.
To build a classifier, we either:
- Discriminative models: estiamte directly
- Generative models: estimate both and for each class
Generative models can mimic the process that generated the data. For example, in the MNIST data, it might be easier to describe what a 4 looks like than how all the digits are different.
Linear Discriminant Analysis
LDA models the data within each class as being normally distributed:
- Each class has its own mean , where is the feature dimension
- All classes have the same covariance matrix
thus
Decision Rule of LDA
The are the discriminant functions, which are affine functions of (linear combination).
LDA Estimation & Inference
Based on the training data
- : the proportion of observations in class ( is the number of points in class )
- : the centroid of class
- : the pooled sample covarance matrix
โ the estimated discriminant functions are:
โ inference:
for new ,
Transformed LDA
Above process requires the calculation of . If is not diagonal, the computation complexity is high. Consider the factorization of
where has orthonormal columns and rows, and with for each . is easily obtained as .
The decision rule of LDA is equivalent to
where
where
This transformation on is basically sphering the data points,
And thus the decision rule is
Further dimension reduction:
since we are comparing the distance of data points to different centroids . Consider a hyper-plane spanned by these centroids (thus the plane has dimensions). Then we only need to compare the distances of the projected point on this plane to these centroids.
Transformed LDA Estimation:
- Compute the sample estimates as before
- Make two transformations
- (to avoid complex matrix inverse) sphere the data points based on factoring
- (to reduce dimension) project the data points down to the affine subspace spanned by the sphered centroids.
- The two linear transformations can be summarized into a single linear transformation , i.e.
Transformed LDA Inference:
- Given any point , transform to . Classify according to the rule
where
This way of describing LDA makes it similar to the Nearest Centroid method, but adjust for class proportions ().
LDA v.s. Logistic Regression
Decision rules of two models are both linear, but assumptions are different
- LDA: Works well if the groups are in โclumpsโ so that the Gaussian distribution is reasonable. Also assumes the shapes are similar.
- Logistic: Only requires that , which is strictly weaker.
LDA will do better when its assumptions are reasonable, but otherwise worse. Logistic focuses more on the boundary points.
Quadratic Discriminant Analysis
QDA does not assume all the shapes are similar, i.e. different classes have different . The Discriminant Functions are thus
which are quadratic functions of .
QDA might have better performance but will have more parameters than LDA. Suppose we have variables and classes
- fitting QDA requires estimating parameters
- fitting LDA requires estimating parameters
As the number of parameters increases, the variance of estimator increases while the bias hopefully decreases.
When we are choosing LDA instead of QDA, we are choosing a more biased model to reduce variance.
Naive Bayes
When the variable dimension is super high, e.g. 2000 observations, each has features, it will be incredibly hard to estimate well for any complicated model.
Naive Bayes assumes that, conditional on , all of the features of are independent, which is a very strong assumption to reduce parameters (and variance). Thus
To estimate the , we can just estimate univariate distributions
Therefore, fitting Naive Bayes requires estimating parameters ( depends on the specific models used for the univariate distributions).
Pros & Cons
Naive Bayes scales well to problems with very large . We only need enough data to estimate each of the marginal distributions well.
Besides, it allows us to have a flexible choice of models for each of the univariate distributions.
However, Naive Bayes cannot capture interactions between the features within each class. (LDA and QDA are able to incorporate these feature interactions, at the cost of needing to estimate them).
Loading Comments...