πŸ’‘

L3. Regression and Classification

TOC

0. Motivation

Given a set of data points and the corresponding labels , for a new data point , predict the label. Our goal is to find a mapping:
  • If is a continuous set, this is called regression
notion image
  • If is a discrete set, this is called classification
notion image
Note that, even for classification where is a diacrete set, we can still fit a continuous function and use it for classification. This is actually what Logistic regression and Softmax regression doing.

1. Logistic Regression

1.1 Linear regression & classification

Linear regression
is linear
where . The intercept term can be absorbed into a new vector and then .
In linear regression, we choose the cost function as the mean squared error (MSE)
Then we can find optimal and by minimizing the cost function
Linear classification
In the feature space, a linear classifier corresponds to a hyperplane.
notion image
There are two typical linear classifiers: Perceptron and Support vector machine (SVM). Note that while Perceptron (single-layer) can only handle linearly separable data, SVM can handle non-linearly separable data.
We can actually use linear regression to do classification. Take binary classification for example, data are labelled with . We can regress on which yields a linear regression model
When implementation, we can set a threshold (0.5 for example). When mapping value of a new data is larger than the threshold, we label it by 1. Otherwise, we label it by 0. That is
notion image
Intuitively, from the figure above we can observe that using a linear model (function) cannot be the optimal choice to fit our dataest. Thus nonlinear functions are introduced.

1.2 Logistic regression

Actually, can be any kind of non-linear function, for example, , etc. as long as we think its curve is suitable to fit the classification dataset. One commonly used non-linear function is the logistic sigmoid function
When implementation
notion image

1.3 Training and testing

Suppose is a nonlinear function
where .
Recall that training process of supervised learning is trying to estimate a conditional probability . And intuitively, we choose the parameter according to maximum likelihood estimation.
Normal distribution assumption & MSE function
Assume that label follows a normal distribution with mean , that is
Note there’s no assumption about the variance of labels which is not needed.
Given a dataset , where are i.i.d. and view and as random variables. The conditional data likelihood function is
Thus is equivalent to
where is the MSE function.
Bernoulli distribution assumption & Cross-entropy function
It’s reasonable to assume labels are following the normal distribution for regression. However, it seems to make no sense to assume that for classification. A more suitable distribution assumption can be the Bernoulli distribution.
For 2-class problems, one 0-1 unit is enough for representing a label. We try to learn a conditional probability
Note the adorable output range of the logistic sigmoid function make it a suitable non-linear function for our model.
notion image
Β 
is a Bernoulli distribution.
Our goal is to serach for a value of so that the probability is
  • large when
  • small when (so that is large)
Given a dataset , where are i.i.d. and , view as a Bernoulli variable and . The conditional likelihood function is then
Maximizing the likelihood is equivalent to minimizing
where is the Cross-entropy (CE) function.
Note that for
Thus
Training and Testing
Some regularization term can be incorportated into the cost function
Training: learn to minimize the cost function
Testing (implementation): for a new input , if then we predict the input as class 1, and 0 otherwise.
Summary
notion image

2. Softmax Regression

2.1 Linear regression for vectors

If is a continuous vector, then use a linear function to regress for :
where
notion image
Β 
Choose the mean squared error (MSE) as the cost function
Then we can find optimal and by solving the linear system

2.2 Non-linear reg. for vec & Multilabel cls.

Non-linear regression for vectors
can also be nonlinear, for
where . can be any kind of nonlinear function (, etc.). One commonly used function is the logistic sigmoid function
We can choose the MSE as the cost function
Denote and , then
is called the local sensitivity or local gradient,
Matrix Differentiation Rules:
where is the number of inputs and is the number of outputs.
Output is
where .
The error function is
The gradients are
Multilabel classification
For classification, given , the goal is to find a mapping from to , i.e.
where is a discrete set. can be a (discrete) scalar or vector as shown below (example when there are 5 classes)
notion image
The scalar representation is seldom used because unfavorable property will be attributed. For example, the difference between class 1 and class 3 is 2 while the difference between class 1 and class 5 is 4. The vector representation (one-hot encoding) can avoid this issue and also has the property

2.3 Training and testing

Normal distribution assumption & MSE function
Assume the label follows a normal distribution with mean , i.e.
Still note there’s no assumption about the variance of labels which is not needed.
Given a dataset . View and as random variables and are i.i.d. respectively. Then the conditional data likelihood function is
Multinoulli distribution assumption & Cross-entropy function
For regression, where is continuous, the normal distribution assumption is natural while for classification where is discrete, it is strange. A better assumption is the multinoulli distribution (categorical prob distribution)
where and . Using one-hot representation , then
An example is shown as below
notion image
For a K-class problem (K>2), we are tring to learn a Multinoullli distribution , where . Let take the following form
Clearly, and .
Our goal is to search for a value of so that the probability is
  • large when belongs to the k-th class
  • small when belongs to other classes
where
Given a dataset . View and as random variables and are i.i.d. respectively. Then the conditional data likelihood function is
Maximizing this likelihood function is equivalent to minimizing
where is the Cross-entropy (CE) function.
Calculate the gradient:
Recall the derivative of two-step composition
notion image
where
Therefore,
where
As for :
  • If , appears only in the denominator
  • If , appears in both numerator and denominator
Therefore
where
Then
where is the local gradient or local sensitivity.
The average over samples is
Vecotor-matrix form
Output is the softmax function where
The gradient of the cross-entroy error function is
Training and Testing
As before, some regularization term can be incorporated into the cost function
Training: minimize the cost function with gradient
where is the learning rate.
Testing: find the maximum among for a new input .
Summary
notion image

3. Other Issues

3.1 Stochastic gradient decent (SGD)

notion image
  • Minimizing the cost function over the entire training set is computationally expensive
  • Decompose the trainging set into minibatches and optimize the cost function defined over individual minibatches
    • The batchsize ranges from 1 to a few hundreds
  • At every iteration, update as follows

3.2 Softmax is over-parameterized

The hypothesis
Then the new parameters will result in the same prediction.
Therefore, minimizing the cross-entropy function has infinite number of solutions because
where

Summary

Perspective I
notion image
Perspective II
Nonlinear regression (linear regression as a special case):
  • Output: , where can be any act function
  • MSE:
  • Gradient:
Softmax regression (logistic regression as a special case):
  • Output: , where is the softmax function
  • Cross-entropy error:
  • Gradient:

Appendix: Implementation of Softmax cls.

πŸ’‘
DL & Finance HW2: Report

Loading Comments...