πŸ’‘

L7. Training Techniques

TOC

1. Optimizers

1.1 SGD Optimizer and Momentum

SGD optimizes over individual minibatches at each iteration
The momentum update is given by
We need to adjust learning rate during training. Although there are different strategies, tuning the learning rates is expensive.

1.2 Adagrad

The above method manipulates the learning rate globally and equally for all parameters. It is possible to adaptively tune the learning rates for individual parameters. Many of these methods may still require other hyperparameter settings, but the argument is that they are well-behaved for a broader range of hyperparameter values than the raw learning rate.
An adaptive learning rate method is Adagrad (Duchi et al. 2011). Denote the gradient . The updating rule is
where is usually set between . is used to normalize the parameter update step. is called effective learning rate. Parameters received small updates will have larger effective learning rates.
Problem: the learning rates are monotonically decreasing, whcih may leads to early stopping.

1.3 RMSProp

Adagrad: let accumulate at all previous steps
RMSProp: let accumulate the recent
In practice:
Typical values for are 0.9, 0.99, 0.999.
RMSProp still modulates the learning rate of each parameter based on the magnitudes of its gradients, but unlike Adagrad, the updates do not get monotonically smaller.
For Adagrad:
All contribute equally to .
For RMSProp:
Contributions of to far away from decay exponentially.
Introduce Momentum
Let
SGD
β‡’ SGD + momentum
RMSProp
β‡’ RMSProp + momentum
Adam
Recommended values: .
Full version (”warm up” version)
where denotes the iteration. At the begining, is small and it is increasing with .

1.4 Learning Rate Schedules

Learning rate can significantly affect training. Due to higher variance, in SGD is typically much smaller than that in GD. We can improve optimization and generalization by tuning the learning rate during training.
notion image
Learning rate decay
to anneal the learning rate at each iteration ,
  • Initially large lr:
    • Accelerates training
    • Escapes bad local minima
    • Avoids fitting noisy data
  • Lr decay:
    • Avoids oscillation
    • Converges to local minimum
    • Learns more complex patterns
notion image
Learning rate warmup
to increase learning rate in the beginning.
  • The adaptive learning rate (RMSProp, Adam) has undesirably large variance in the early stage, due to the limited amount of training samples being used.
  • Warmup works as a variance reduction technique to stabilize training, accelerate convergence and improve generalization.

Summary

notion image
All optimizers have a learning rate but training results may not be sensitive to settings. Besides, learning rate decay is always a good strategy.

2. Techniques Dealing with Overfitting

2.1 Weight Regularization (decay)

Consider a regression problem in which input and output are both scalars. We set limitation on which means none dimensions of can be so large that noise in one dimension will be enlarged dramatically.
Other techniques include: early stopping, dropout,

2.2 Early Stopping

notion image
When the loss on the training set is decreasing, but the loss on the validation set begins to increase, it means there exists overfitting. Early stopping is stop training when obeserving such a situation.

2.3 Dropout

On each presentation of each training case, each hidden unit is randomly omitted (zero its output) from the network with a probability .
These zero values are used in the backward pass propagating the derivatives to the parameters.
Advantage
  • A hidden unit cannot rely on other hidden units being present, therefore we prevent complex co-adaptations of the neurons on the training data.
  • It trains a huge number of different networks in a reasonable time, then average their predictions.
Tesing phase
Use the β€œmean network” that contains all of the hidden units.
notion image
But we need to adjust the outgoing weights of neurons to compensate for the fact that in training only a portion of them are active
  • If , we halve the weights
  • If , we multiply the weights with , i.e., 0.9
In practice, this gives very similar performance to averaging over a large number of dropout networks.
Remarks:
  • In some implementations, during test, is multiplied with the output of the activation function, say , instead of the weights . Then
  • In some implementations, the output of the activation function is changed as follows
    • during training, while the output of the activation function is unchanged during test.
  • In practice, is set lower in lower layers, e.g. 0.2, but higher in higher layers, e.g., 0.5
  • In the literature or some software, the dropout rate is sometimes defined as the probability for retaining the output of each node, i.e.,

2.4 Data Augmentation

Let denote the original training set.
  1. Add variations to the input data
    1. while keep the label unchanged
  1. Use the augmented training set
    1. to train the model
Commonly used variations for images include:
  • flips
  • translations
  • crops and scales
  • stretching
  • shearing
  • cutout or erasing
  • mixup
  • color jittering
or a combination of above.

3. Batch Normalization

3.1 Internal Covariate Shift (ICS)

Since we use SGD, the input mini-batches to the neural network at different iterations are different. This may cause the distributions of the output of a layer to be different at different iterations as shown below.
notion image
Internal Covariate Shift (ICS) is the change in the distributions of internal nodes of a deep network, in the course of training. ICS may cause difficulty in optimization.

3.2 Reduce ICS by normalization

We can normalize each scalar feature independently, by making it have the mean zero and the variance of 1.
notion image
Β 
Denote a d-dimensional activation , do normalization
Keep the representation ability of the layer
if and , then we recover the original activations. We can construct a new layer named Bacth Normalization (BN) layer:
where
Forward pass
During inference
The normalization of activations that depends on the mini-batch allows efficient training, but is neither necessary nor desirable during inference. Once the network has been trained, we normalize
where and are measured over the entire training set,
where is the size of the mini-batches.
Location in a network
It is often applied before the non-linearity of the previous layer (empirical results). Actually, after nonlinearity, the shape of the activation distribution is likely to change during training, and constraining its first and second moments would not eliminate the covariate shift.
The previous layer of BN layer is thus always a linear transformation layer (fully-connected layer or convolutional layer)
The bias term can be ignored because BN has a shift term that have the same effect. Therefore
For BN in CNN, it is applied after the convolutional layer. It is required that different elements of the same feature map are normalized in the same way. Suppose the mini-batch size is , and the feature map size is , then the mean and variance are calculated across elements. We should learn a pair of parameters and per feature map, rather than per activation. The inference procedure is modified similarly, so that during inference BN applies the same linear transformation to each activation in a given feature map. For more details.
However, the reason of the good performance of BN is controversial.

4. Choosing Hyperparameters

Hyperparameters: control the algorithm’s behavior and are not adapted by the algorithm itself. They often determines the capacity of the model.
For a deep learning model, the hyperparameters include:
  • The number of layers, the number of neurons per layer, etc.
  • Regularization term coefficient
  • Learning rate
  • Weight decay rate
  • Momentum rate

4.1 A practical guide to choose hyperparameters

Step 1: Check initial loss
Ensure the calculation of the loss is correct and set weight decay to zero.
Step 2: Overfit a small sample
  • Train the model on a small set of training samples and try to get 100% training accuracy
    • This can ensure codes about the pipeline from data preprocessing to output is correct
    • It can also give some hints for tuning the hyperparameters
  • If the training loss does not go down
    • Possible reasons: bad initialization, learning rate too low, model too small
  • If the training loss explodes to Inf or NaN
    • Possible reasons: bad initialization, learning rate too high
Step 3: Find learning rates that makes loss go down
Train the model on all data and find learning rates that makes loss drop quickly and significantly
  • Decrease the learning rate 10 times from 0.1 whenever the loss decreases slowly
  • Use small weight decay
Step 4: Coarse grid, train for 1~5 epochs
  • Try a few values around the learning rate and weight decay found in the previous step
  • Good weight decay to try: 1e-4, 1e-5, 0
Step 5: Refine grid, train longer
Use the best values found in the previous step and train the model longer (~10-20 epochs) without learning rate decay.
Step 6: Look at loss curves
  • Training loss is usually plotted by using the running average
  • Otherwise there would be tremendous points cluttered together
notion image
Some implementations:
notion image
notion image

Appendix: MLP & CNN on MNIST / CIFAR-10

πŸ’‘
HW4 Report
Β 

Loading Comments...