โ™Ÿ๏ธ

MLI-2

Classification

The classification refers to problems where we have and try to guess using a classification function , which is usually estimated from training data. We will mostly focus on two-class problems (), and denote the classes as or .

Confusion Matrix

Confusion matrix tabulates guessed categories against true categories.
ใ…ค
True Y = 1
True Y = 0
Guess Y = 1
True Positive (TP, )
False Positive (FP, )
Guess Y = 0
False Negative (FN, )
True Negative (TN, )
  • Misclassification rate:
  • Accuracy:

Loss for classification

0-1 Loss
if .
The 0-1 loss is unrealistic for many scenarios, since the cost for error is usually asymmetric.
Statistical Decision Theory
Ideally, we would choose our prediction to minimize the expected loss (expected prediction error or expected misclassification error).
denote by , try to minimize the inside of the expectation over for each possible value of , pointwise
The result implies that, to minimize the loss given the feature measurements , the prediction should be the most likely class. This is called the Bayes classifier, and it achieves the best possible expected misclassification error (if we could do it).
Recall the Bayes theorem
Let be the prior probability of class . Since the Bayes classifier compares the above quantity across for , the denominator is always the same, hence

Building a classifier

Following above results
To build a classifier, we can
  • try to estimate directly โ‡’ Discriminative Models
    • Logistic regression is a parametric version of this
    • KNN is a non-parametric version of this
  • try to estimate the distribution for each class โ‡’ Generative Models
    • Linear discriminant analysis (LDA), and Naive Bayes are classic examples of this

Logistic Regression

Modelling

In logistic regression, we model
for some unknown , which we will estimate directly. Since , we have
The function
can be used to transform real numbers into .
notion image
Classification by logistic regression
Suppose we fit a logistic regression, estimating , then

Estimating coefficients

We can use maximum likelihood to estimate the .
Suppose we are given i.i.d. sample . Here denotes the class of the -th observation. Then
and log likelihood is
The coefficients are estimated by maximizing the likelihood
Remark
  • If there are many variables, we would expect the variance of our estimated to grow. To reduce the variance, we can add regularization term in the objective function.

Multinomial Logistic Regression

We can generalize logistic regression to classes, leveraing the same ideas. We now have vectors , and define
Since the sum of probability is 1, we only need equations.
These probabilities are given by the softmax function.

Imbalanced Data

Suppose that an event only happens 5% of the time in the general population. We know that we should classify as when to minimize expected misclassification rate. By Bayes Theorem,
therefore, to classify as , we would need to see an that is 10 times more likely in the population than the general population. This posed a significant challenge for our classifier. Even if there is useful information in the data, it is probably not enough to flip the classifications (i.e. the classifier will never be confident enough to predict ).
There are two potential problems led by the imbalanced data
  • Some methods will build poor estimates of . This is usually fixed with weighting or resampling methods.
  • Misclassification rates can be nearly impossible to improve, even when we estimate correctly. This can not be fixed by weighting or resampling methods. However, we can modify the metric.

ROC & Other Metrics

ROC

Sensitivity (Recall, Hit rate)
  • Fraction of points with that we find
  • We want high Sensitivity when False Negatives are more costly than False Positives, e.g. Fraud detection.
Specificity (True Negative Rate, TNR)
  • Fraction of points with that we avoid
  • We want high specificity when False Positives are more expensive than False Negatives, e.g. Criminal trials.
We would like both of them to be high, but we need to make tradeoffs as we adjust our threshold. For example, we can flag more cases as fraud to improve sensitivity, but this will decrease the specificity.
Therefore, to trade off (asymmetric) costs and error rates, we can just changing the threshold on . It is convenient to look at the Sensitivity and Specificity of our classifier as the threshold changes. This corresponds to the ROC (Receiver operating characteristic) curve.
ROC example
ROC example
ROC example
ROC example
A generally higher curve tends to be better, though there is still a tradeoff (between Sensitivity and Specificity). The Area Under the Curve (AUC) is one way to measure this. A higher area tends to be more attractive.
AUC has a nice interpretation: If we randomly pick two observations from the distribution and order them by , the AUC is the probability that our ordering is correct.
In reality though, we are interested in the performance cutoff at a particular point on the curve.

Other Metrics

Positive Predicitve Value (PPV)
Suppose you take a test for a disease. The PPV captures the probability that you actually have the disease when your test is positive.
Negative Predictive Value (NPV)
Suppose you take a test for a disease. The NPV is the probability that you are not sick, given that the test comes up negative.
Some other metrics
notion image

Calibration Plot

We wonder whether our is a good estimate of . To check how well calibrated our is, we can make a calibration plot:
  • Bin the data according to predicted probability, , e.g. .
  • For each bin, calculate the proportion of observations with class .
  • Plot these true fractions against the midpoint of the bin predicted probabilities.
Calibration plot example
Calibration plot example
For the example plot above, random forest (rf) seems to be more calibrated because its curve is closer to the 45 degree line. The qda is less calibrated, under estimating when providing low probabilities and overestimating when providing high probabilities.

Loading Comments...