TOC
1. Multi-layer Perceptron Basis1.1 Perceptron1.2 Structure of MLP1.3 Forward Calculation2. Backpropagation2.1 Loss function2.2 Weight adjustment2.3 Gradient and local sensitivity 3. Layer Decomposition3.1 Decomposition Methods3.2 BP of decomposed layers4. Training Techniques4.1 Weight initialization4.2 Learning rate4.3 Order of training samples4.4 Pathological Curvature & MomentumAppendix: A Detailed implementation of MLP
1. Multi-layer Perceptron Basis
1.1 Perceptron
Perceptron (single layer) is a linear function plus a step-jump function
For each data point and the corresponding labels :
- Calculate the actual output
- Update the weights:
- , where is the learning rate
Note that there is no such thing as objective function (loss function) during the learning process of perceptron (single layer). It just follows above steps to learn from the dataset. Besides, are both binary values which is significantly different from MLP.
Limitation of the perceptron (single layer) is that it can only solve linearly separable classification problems
For linearly non-separable problems, for example, the XOR problem, it surrenders.
1.2 Structure of MLP
There are a total of layers except the input.
Connections:
- Full connections between layers
- No feedback connections between layers
- No lateral connections in the same layer
Every neuron receives input from previous layer and fire according to an activation function.
1.3 Forward Calculation
Forward pass
For calculate the input to neuron in the -th layer
and its output where is activation function. (Note that )
corresponds to the output layer discussed in lecture 3.
Activation functions
Activation function of the hidden layers can be any kind of non-linear function. Three commonly used activation functions are:
- Logistic function:
- gradient:
- Hyperbolic tangent function:
- gradient:
- Rectified linear activation function (ReLU):
- gradient:
Vector-matrix form
If the previous layer has neurons and the current layer has neurons, define the weight matrix and bias vector as
Then for
MLP designed for XOR problem
The boolean function fo the XOR problem is shown as below left. We can design a MLP structure as below right to handle this problem.
It is easy to find a suitable weight for this problem:
- , where
- , where
The original space and learned space are shown as below. Itβs clear that space has become linearly separable.
2. Backpropagation
2.1 Loss function
where is the loss function (a function of , the last layer) for each input sample .
- If the task is regression, can take: L2 loss (MSE) or L1 loss.
- If the task is classification, can take: Cross-entropy loss, L2 loss (MSE) or L1 loss.
2.2 Weight adjustment
For regression, there is only one neuron in the last layer, which can be viewed as a special case of MLPs for classification. The following derivation is based on MLP classification.
For classification, the loss function can be
- Squared error (Euclidean loss, MSE)
- Cross-entropy loss
where is target of the form
Note that except , for clarity, we omit the superscript on etc. for each input sample.
Weights update:
where is the learning rate.
Weight decay is often used on (not necessary on ) which amounts to adding an additional term on the cost function
Weight adjustment on is then changed to
2.3 Gradient and local sensitivity
Layers in MLP actually fall into three categories:
- Hidden layer:
- MSE layer:
- CE layer:
Note that all layers have input
Define local sensitivity as: . Then for
Local sensitivity for MSE layer
The output of the last layer units of MLP are
where is the output of the units in the (L-1)-th layer. And the activation function can be logistic sigmoid, tanh or ReLU.
The error for each sample is , thus the local sensitivity is
Local sensitivity for CE layer
If the softmax regression is used in the last layer of an MLP, the probabilistic function becomes
where is the output of the units in the (L-1)-th layer.
The error for each sample is , thus the local sensitivity is
Local sensitivity for other layers
If , i.e. neuron is a hidden neuron, it has an effect on all neurons in the next layer, therefore its local sensitivity is
where . here can be any activation function.
We compute backwards, from .
In practice should be calculated in layer because it is determined by the params of layer .
Vector-matrix form
Local sensitivity
For the output layer :
- MSE:
- CE:
For hidden layer :
For , the gradients are:
Update weights:
Gradient vanishing
Note that for the hidden layers :
- For logistic function
- For tanh function
For these two sigmoid functions, is smaller and smaller from to 1. The gradient approaches zero in lower layers, which is called gradient vanishing.
Implementation of BP
- Run forward process
- Calculate and for
- Run backward process
- Calculate and , for
- Update and for
3. Layer Decomposition
3.1 Decomposition Methods
The input layer or hidden layer
can be decomposed into two layers:
- Fully connected layer:
- Activation layer:
The squared error layer
can be decomposed into three layers:
- Fully connected layer:
- Activation layer:
- Loss layer:
A decomposition example is shown as below
The Cross-entropy loss layer
can be decomposed into two layers:
- Softmax layer: , where is the softmax function
- Loss layer:
However, the CE layer is often not decomposed this way but implemented together. It is because can only be used with softmax layer. It is often decomposed into:
- Fully connected layer:
- Activation & Loss layer:
A decomposition example is shown as below
Note: whether there is a need of activation layer after the last FC depends on situations, more explanation.
3.2 BP of decomposed layers
- Euclidean loss layer:
- According to the decomposition method above, there are L+1 layers in total
- We can calculate
- Softmax-Cross-entropy loss layer: where (softmax activation function)
- According to the decomposition method above, there are L layers in total
- We can calculate ( denotes the softmax function)
- if :
- if :
Proof
Since we do not decompose softmax (activate function) and Cross-entropy in SCE loss layer, the input of this layer is which is the output of the last FC layer. Now we derive :
Since , .
Since
Therefore
Therefore
- Fully connected layer: or
- Sigmoid layer: or , where is a sigmoid function
- ReLU layer: or , where is a ReLU function
Note that and are identical in every layer .
Summary
All layers discussed so far can be decomposed into: input layer, hidden layer ,logistic regression layer, softmax regression layer, loss layer.
There are one exception: softmax function and cross-entropy loss function are bound together.
4. Training Techniques
4.1 Weight initialization
inputting to a neuron is drawn from a distribution. It can be
- Gaussian: a Gaussian distribution with zero mean and fixed std, e.g. 0.01
- Xavier: a distribution with zero mean and a specific std , where is the number of neurons feeding into the neuron. Gaussian distribution or uniform distribution is often used
- MSRA: a Gaussian distribution with zero mean and a specific std
4.2 Learning rate
In SGD, the learning rate is typically much smaller than a corresponding learning rate in batch gradient descent because there is much more variance in the update.
Choose the proper schedule:
- Use a small enough constant learning rate that gives stable convergence in the intial epoch (full pass through the training set) or two of training and then halve the value of the learning rate as convergence slows down.
- Evaluate a held out set after each epoch and anneal the learning rate when the change in objective between epochs is below a small threshold
- Anneal the learning rate at each iteration as where and dictate the initial learning rate and when the annealing begins respectively.
4.3 Order of training samples
If the data is given in some meaningful order, this can bias the gradient and lead to poor convergence. Generally a good method to avoid this is to randomly shuffle the data prior to each epoch of trianing.
4.4 Pathological Curvature & Momentum
As above left shown, the objective has the form of a long shallow ravine leading to the optimum and steep walls on the sides.
As above right shown, the objectives of deep architectures have this form near local optima and thus standard SGD tends to oscillate across the narrow ravine.
Momentum is one method for pushing the objective more quickly along the shallow ravine. The momentum update is given by
- is the current velocity vector, which accumulates the previous gradients
- determines for how many iterations the previous gradients are incorporated into the current update
- One strategy: is set to 0.5 until the initial learning stabilizes and then is increased to 0.9 or higher
An example is shown as below
Let
The green vector is the change of in the previous step.
- Standard gradient decent:
- Gradient decent with momentum:
It can be observed that is better compared to .
Β
Loading Comments...