L4. Multi-layer Perceptron

TOC

1. Multi-layer Perceptron Basis 1.1 Perceptron 1.2 Structure of MLP 1.3 Forward Calculation 2. Backpropagation 2.1 Loss function 2.2 Weight adjustment 2.3 Gradient and local sensitivity 3. Layer Decomposition 3.1 Decomposition Methods 3.2 BP of decomposed layers 4. Training Techniques 4.1 Weight initialization 4.2 Learning rate 4.3 Order of training samples 4.4 Pathological Curvature & Momentum Appendix: A Detailed implementation of MLP

1. Multi-layer Perceptron Basis

1.1 Perceptron

Perceptron (single layer) is a linear function plus a step-jump function

For each data point and the corresponding labels :

Calculate the actual output

Update the weights:

, where is the learning rate

Note that there is no such thing as objective function (loss function) during the learning process of perceptron (single layer). It just follows above steps to learn from the dataset. Besides, are both binary values which is significantly different from MLP.

Limitation of the perceptron (single layer) is that it can only solve linearly separable classification problems

For linearly non-separable problems, for example, the XOR problem, it surrenders.

1.2 Structure of MLP

There are a total of layers except the input.

Connections:

Full connections between layers

No feedback connections between layers

No lateral connections in the same layer

Every neuron receives input from previous layer and fire according to an activation function.

1.3 Forward Calculation

Forward pass

For calculate the input to neuron in the -th layer

and its output where is activation function. (Note that )

corresponds to the output layer discussed in lecture 3.

Activation functions

Activation function of the hidden layers can be any kind of non-linear function. Three commonly used activation functions are:

Logistic function:

gradient:

Hyperbolic tangent function:

gradient:

Rectified linear activation function (ReLU):

gradient:

Vector-matrix form

If the previous layer has neurons and the current layer has neurons, define the weight matrix and bias vector as

Then for

MLP designed for XOR problem

The boolean function fo the XOR problem is shown as below left. We can design a MLP structure as below right to handle this problem.

It is easy to find a suitable weight for this problem:

, where

, where

The original space and learned space are shown as below. It’s clear that space has become linearly separable.

2. Backpropagation

2.1 Loss function

where is the loss function (a function of , the last layer) for each input sample .

If the task is regression, can take: L2 loss (MSE) or L1 loss.

If the task is classification, can take: Cross-entropy loss, L2 loss (MSE) or L1 loss.

2.2 Weight adjustment

For regression, there is only one neuron in the last layer, which can be viewed as a special case of MLPs for classification. The following derivation is based on MLP classification.

For classification, the loss function can be

Squared error (Euclidean loss, MSE)

Cross-entropy loss

where is target of the form

Note that except , for clarity, we omit the superscript on etc. for each input sample.

Weights update:

where is the learning rate.

Weight decay is often used on (not necessary on ) which amounts to adding an additional term on the cost function

Weight adjustment on is then changed to

2.3 Gradient and local sensitivity

Layers in MLP actually fall into three categories:

Hidden layer:

MSE layer:

CE layer:

Note that all layers have input

Define local sensitivity as: . Then for

Local sensitivity for MSE layer

The output of the last layer units of MLP are

where is the output of the units in the (L-1)-th layer. And the activation function can be logistic sigmoid, tanh or ReLU.

The error for each sample is , thus the local sensitivity is

Local sensitivity for CE layer

If the softmax regression is used in the last layer of an MLP, the probabilistic function becomes

where is the output of the units in the (L-1)-th layer.

The error for each sample is , thus the local sensitivity is

Local sensitivity for other layers

If , i.e. neuron is a hidden neuron, it has an effect on all neurons in the next layer, therefore its local sensitivity is

where . here can be any activation function.

We compute backwards, from .

In practice should be calculated in layer because it is determined by the params of layer .

Vector-matrix form

Local sensitivity

For the output layer :

MSE:

For hidden layer :

For , the gradients are:

Update weights:

Gradient vanishing

Note that for the hidden layers :

For logistic function

For tanh function

For these two sigmoid functions, is smaller and smaller from to 1. The gradient approaches zero in lower layers, which is called gradient vanishing.

Implementation of BP

Run forward process

Calculate and for

Run backward process

Calculate and , for

Update and for

3. Layer Decomposition

3.1 Decomposition Methods

The input layer or hidden layer

can be decomposed into two layers:

Fully connected layer:

Activation layer:

The squared error layer

can be decomposed into three layers:

Fully connected layer:

Activation layer:

Loss layer:

A decomposition example is shown as below

The Cross-entropy loss layer

can be decomposed into two layers:

Softmax layer: , where is the softmax function

Loss layer:

However, the CE layer is often not decomposed this way but implemented together. It is because can only be used with softmax layer. It is often decomposed into:

Fully connected layer:

Activation & Loss layer:

A decomposition example is shown as below

A MLP with one hidden layer using CE loss

Note: whether there is a need of activation layer after the last FC depends on situations, more explanation.

3.2 BP of decomposed layers

Euclidean loss layer:

According to the decomposition method above, there are L+1 layers in total
We can calculate

Softmax-Cross-entropy loss layer: where (softmax activation function)

According to the decomposition method above, there are L layers in total
We can calculate ( denotes the softmax function)

Proof

Since we do not decompose softmax (activate function) and Cross-entropy in SCE loss layer, the input of this layer is which is the output of the last FC layer. Now we derive :

Since , .

Since

if :

if :

Therefore

Fully connected layer: or

Sigmoid layer: or , where is a sigmoid function

ReLU layer: or , where is a ReLU function

Note that and are identical in every layer .

Summary

All layers discussed so far can be decomposed into: input layer, hidden layer ,logistic regression layer, softmax regression layer, loss layer.

There are one exception: softmax function and cross-entropy loss function are bound together.

4. Training Techniques

4.1 Weight initialization

inputting to a neuron is drawn from a distribution. It can be

Gaussian: a Gaussian distribution with zero mean and fixed std, e.g. 0.01

Xavier: a distribution with zero mean and a specific std , where is the number of neurons feeding into the neuron. Gaussian distribution or uniform distribution is often used

MSRA: a Gaussian distribution with zero mean and a specific std

4.2 Learning rate

In SGD, the learning rate is typically much smaller than a corresponding learning rate in batch gradient descent because there is much more variance in the update.

Choose the proper schedule:

Use a small enough constant learning rate that gives stable convergence in the intial epoch (full pass through the training set) or two of training and then halve the value of the learning rate as convergence slows down.

Evaluate a held out set after each epoch and anneal the learning rate when the change in objective between epochs is below a small threshold

Anneal the learning rate at each iteration as where and dictate the initial learning rate and when the annealing begins respectively.

4.3 Order of training samples

If the data is given in some meaningful order, this can bias the gradient and lead to poor convergence. Generally a good method to avoid this is to randomly shuffle the data prior to each epoch of trianing.

4.4 Pathological Curvature & Momentum

As above left shown, the objective has the form of a long shallow ravine leading to the optimum and steep walls on the sides.

As above right shown, the objectives of deep architectures have this form near local optima and thus standard SGD tends to oscillate across the narrow ravine.

Momentum is one method for pushing the objective more quickly along the shallow ravine. The momentum update is given by