L7. Training Techniques

TOC

1. Optimizers 1.1 SGD Optimizer and Momentum 1.2 Adagrad 1.3 RMSProp 1.4 Learning Rate Schedules Summary 2. Techniques Dealing with Overfitting 2.1 Weight Regularization (decay)2.2 Early Stopping 2.3 Dropout 2.4 Data Augmentation 3. Batch Normalization 3.1 Internal Covariate Shift (ICS)3.2 Reduce ICS by normalization 4. Choosing Hyperparameters 4.1 A practical guide to choose hyperparameters Appendix: MLP & CNN on MNIST / CIFAR-10

1. Optimizers

1.1 SGD Optimizer and Momentum

SGD optimizes over individual minibatches at each iteration

The momentum update is given by

We need to adjust learning rate during training. Although there are different strategies, tuning the learning rates is expensive.

1.2 Adagrad

The above method manipulates the learning rate globally and equally for all parameters. It is possible to adaptively tune the learning rates for individual parameters. Many of these methods may still require other hyperparameter settings, but the argument is that they are well-behaved for a broader range of hyperparameter values than the raw learning rate.

An adaptive learning rate method is Adagrad (Duchi et al. 2011). Denote the gradient . The updating rule is

where is usually set between . is used to normalize the parameter update step. is called effective learning rate. Parameters received small updates will have larger effective learning rates.

Problem: the learning rates are monotonically decreasing, whcih may leads to early stopping.

1.3 RMSProp

Adagrad: let accumulate at all previous steps

RMSProp: let accumulate the recent

In practice:

Typical values for are 0.9, 0.99, 0.999.

RMSProp still modulates the learning rate of each parameter based on the magnitudes of its gradients, but unlike Adagrad, the updates do not get monotonically smaller.

For Adagrad:

All contribute equally to .

For RMSProp:

Contributions of to far away from decay exponentially.

Introduce Momentum

Let

SGD

⇒ SGD + momentum

RMSProp

⇒ RMSProp + momentum

Adam

Recommended values: .

Full version (”warm up” version)

where denotes the iteration. At the begining, is small and it is increasing with .

1.4 Learning Rate Schedules

Learning rate can significantly affect training. Due to higher variance, in SGD is typically much smaller than that in GD. We can improve optimization and generalization by tuning the learning rate during training.

Learning rate decay

to anneal the learning rate at each iteration ,

Initially large lr:

Accelerates training
Escapes bad local minima
Avoids fitting noisy data

Lr decay:

Avoids oscillation
Converges to local minimum
Learns more complex patterns

Learning rate warmup

to increase learning rate in the beginning.

The adaptive learning rate (RMSProp, Adam) has undesirably large variance in the early stage, due to the limited amount of training samples being used.

Warmup works as a variance reduction technique to stabilize training, accelerate convergence and improve generalization.

Summary

All optimizers have a learning rate but training results may not be sensitive to settings. Besides, learning rate decay is always a good strategy.

2. Techniques Dealing with Overfitting

2.1 Weight Regularization (decay)

Consider a regression problem in which input and output are both scalars. We set limitation on which means none dimensions of can be so large that noise in one dimension will be enlarged dramatically.

Other techniques include: early stopping, dropout,

2.2 Early Stopping

When the loss on the training set is decreasing, but the loss on the validation set begins to increase, it means there exists overfitting. Early stopping is stop training when obeserving such a situation.

2.3 Dropout

On each presentation of each training case, each hidden unit is randomly omitted (zero its output) from the network with a probability .

These zero values are used in the backward pass propagating the derivatives to the parameters.

Advantage

A hidden unit cannot rely on other hidden units being present, therefore we prevent complex co-adaptations of the neurons on the training data.

It trains a huge number of different networks in a reasonable time, then average their predictions.

Tesing phase

Use the “mean network” that contains all of the hidden units.

But we need to adjust the outgoing weights of neurons to compensate for the fact that in training only a portion of them are active

If , we halve the weights

If , we multiply the weights with , i.e., 0.9

In practice, this gives very similar performance to averaging over a large number of dropout networks.

Remarks:

In some implementations, during test, is multiplied with the output of the activation function, say , instead of the weights . Then

In some implementations, the output of the activation function is changed as follows

during training, while the output of the activation function is unchanged during test.

In practice, is set lower in lower layers, e.g. 0.2, but higher in higher layers, e.g., 0.5

In the literature or some software, the dropout rate is sometimes defined as the probability for retaining the output of each node, i.e.,

2.4 Data Augmentation

Let denote the original training set.

Add variations to the input data

while keep the label unchanged

Use the augmented training set

to train the model

Commonly used variations for images include:

flips

translations

crops and scales

stretching

shearing

cutout or erasing

mixup

color jittering

or a combination of above.

3. Batch Normalization

3.1 Internal Covariate Shift (ICS)

Since we use SGD, the input mini-batches to the neural network at different iterations are different. This may cause the distributions of the output of a layer to be different at different iterations as shown below.

Internal Covariate Shift (ICS) is the change in the distributions of internal nodes of a deep network, in the course of training. ICS may cause difficulty in optimization.

3.2 Reduce ICS by normalization

We can normalize each scalar feature independently, by making it have the mean zero and the variance of 1.

Denote a d-dimensional activation , do normalization

Keep the representation ability of the layer

if and , then we recover the original activations. We can construct a new layer named Bacth Normalization (BN) layer:

where

Forward pass

During inference

The normalization of activations that depends on the mini-batch allows efficient training, but is neither necessary nor desirable during inference. Once the network has been trained, we normalize

where and are measured over the entire training set,

where is the size of the mini-batches.

Location in a network

It is often applied before the non-linearity of the previous layer (empirical results). Actually, after nonlinearity, the shape of the activation distribution is likely to change during training, and constraining its first and second moments would not eliminate the covariate shift.

The previous layer of BN layer is thus always a linear transformation layer (fully-connected layer or convolutional layer)

The bias term can be ignored because BN has a shift term that have the same effect. Therefore

For BN in CNN, it is applied after the convolutional layer. It is required that different elements of the same feature map are normalized in the same way. Suppose the mini-batch size is , and the feature map size is , then the mean and variance are calculated across elements. We should learn a pair of parameters and per feature map, rather than per activation. The inference procedure is modified similarly, so that during inference BN applies the same linear transformation to each activation in a given feature map. For more details.

However, the reason of the good performance of BN is controversial.