TOC
0. Motivation1. Logistic Regression1.1 Linear regression & classification1.2 Logistic regression1.3 Training and testing2. Softmax Regression2.1 Linear regression for vectors2.2 Non-linear reg. for vec & Multilabel cls.2.3 Training and testing3. Other Issues3.1 Stochastic gradient decent (SGD)3.2 Softmax is over-parameterizedSummaryAppendix: Implementation of Softmax cls.
0. Motivation
Given a set of data points and the corresponding labels , for a new data point , predict the label. Our goal is to find a mapping:
- If is a continuous set, this is called regression
- If is a discrete set, this is called classification
Note that, even for classification where is a diacrete set, we can still fit a continuous function and use it for classification. This is actually what Logistic regression and Softmax regression doing.
1. Logistic Regression
1.1 Linear regression & classification
Linear regression
is linear
where . The intercept term can be absorbed into a new vector and then .
In linear regression, we choose the cost function as the mean squared error (MSE)
Then we can find optimal and by minimizing the cost function
Linear classification
In the feature space, a linear classifier corresponds to a hyperplane.
There are two typical linear classifiers: Perceptron and Support vector machine (SVM). Note that while Perceptron (single-layer) can only handle linearly separable data, SVM can handle non-linearly separable data.
We can actually use linear regression to do classification. Take binary classification for example, data are labelled with . We can regress on which yields a linear regression model
When implementation, we can set a threshold (0.5 for example). When mapping value of a new data is larger than the threshold, we label it by 1. Otherwise, we label it by 0. That is
Intuitively, from the figure above we can observe that using a linear model (function) cannot be the optimal choice to fit our dataest. Thus nonlinear functions are introduced.
1.2 Logistic regression
Actually, can be any kind of non-linear function, for example, , etc. as long as we think its curve is suitable to fit the classification dataset. One commonly used non-linear function is the logistic sigmoid function
When implementation
1.3 Training and testing
Suppose is a nonlinear function
where .
Recall that training process of supervised learning is trying to estimate a conditional probability . And intuitively, we choose the parameter according to maximum likelihood estimation.
Normal distribution assumption & MSE function
Assume that label follows a normal distribution with mean , that is
Note thereβs no assumption about the variance of labels which is not needed.
Given a dataset , where are i.i.d. and view and as random variables. The conditional data likelihood function is
Thus is equivalent to
where is the MSE function.
Bernoulli distribution assumption & Cross-entropy function
Itβs reasonable to assume labels are following the normal distribution for regression. However, it seems to make no sense to assume that for classification. A more suitable distribution assumption can be the Bernoulli distribution.
For 2-class problems, one 0-1 unit is enough for representing a label. We try to learn a conditional probability
Note the adorable output range of the logistic sigmoid function make it a suitable non-linear function for our model.
Β
is a Bernoulli distribution.
Our goal is to serach for a value of so that the probability is
- large when
- small when (so that is large)
Given a dataset , where are i.i.d. and , view as a Bernoulli variable and . The conditional likelihood function is then
Maximizing the likelihood is equivalent to minimizing
where is the Cross-entropy (CE) function.
Note that for
Thus
Training and Testing
Some regularization term can be incorportated into the cost function
Training: learn to minimize the cost function
Testing (implementation): for a new input , if then we predict the input as class 1, and 0 otherwise.
Summary
2. Softmax Regression
2.1 Linear regression for vectors
If is a continuous vector, then use a linear function to regress for :
where
Β
Choose the mean squared error (MSE) as the cost function
Then we can find optimal and by solving the linear system
2.2 Non-linear reg. for vec & Multilabel cls.
Non-linear regression for vectors
can also be nonlinear, for
where . can be any kind of nonlinear function (, etc.). One commonly used function is the logistic sigmoid function
We can choose the MSE as the cost function
Denote and , then
is called the local sensitivity or local gradient,
Matrix Differentiation Rules:
where is the number of inputs and is the number of outputs.
Output is
where .
The error function is
The gradients are
Multilabel classification
For classification, given , the goal is to find a mapping from to , i.e.
where is a discrete set. can be a (discrete) scalar or vector as shown below (example when there are 5 classes)
The scalar representation is seldom used because unfavorable property will be attributed. For example, the difference between class 1 and class 3 is 2 while the difference between class 1 and class 5 is 4. The vector representation (one-hot encoding) can avoid this issue and also has the property
2.3 Training and testing
Normal distribution assumption & MSE function
Assume the label follows a normal distribution with mean , i.e.
Still note thereβs no assumption about the variance of labels which is not needed.
Given a dataset . View and as random variables and are i.i.d. respectively. Then the conditional data likelihood function is
Multinoulli distribution assumption & Cross-entropy function
For regression, where is continuous, the normal distribution assumption is natural while for classification where is discrete, it is strange. A better assumption is the multinoulli distribution (categorical prob distribution)
where and . Using one-hot representation , then
An example is shown as below
For a K-class problem (K>2), we are tring to learn a Multinoullli distribution , where . Let take the following form
Clearly, and .
Our goal is to search for a value of so that the probability is
- large when belongs to the k-th class
- small when belongs to other classes
where
Given a dataset . View and as random variables and are i.i.d. respectively. Then the conditional data likelihood function is
Maximizing this likelihood function is equivalent to minimizing
where is the Cross-entropy (CE) function.
Calculate the gradient:
Recall the derivative of two-step composition
where
Therefore,
where
As for :
- If , appears only in the denominator
- If , appears in both numerator and denominator
Therefore
where
Then
where is the local gradient or local sensitivity.
The average over samples is
Vecotor-matrix form
Output is the softmax function where
The gradient of the cross-entroy error function is
Training and Testing
As before, some regularization term can be incorporated into the cost function
Training: minimize the cost function with gradient
where is the learning rate.
Testing: find the maximum among for a new input .
Summary
3. Other Issues
3.1 Stochastic gradient decent (SGD)
- Minimizing the cost function over the entire training set is computationally expensive
- Decompose the trainging set into minibatches and optimize the cost function defined over individual minibatches
- The batchsize ranges from 1 to a few hundreds
- At every iteration, update as follows
3.2 Softmax is over-parameterized
The hypothesis
Then the new parameters will result in the same prediction.
Therefore, minimizing the cross-entropy function has infinite number of solutions because
where
Summary
Perspective I
Perspective II
Nonlinear regression (linear regression as a special case):
- Output: , where can be any act function
- MSE:
- Gradient:
Softmax regression (logistic regression as a special case):
- Output: , where is the softmax function
- Cross-entropy error:
- Gradient:
Loading Comments...