1. Implementation of Softmax cls.
Notations:
- : weights matrix
- : a minibatch of input data (total samples)
- : labels of minibatch (total samples)
- : weight decay multiplier
Loss:
Gradient:
Prediction:
2. Experiment Results and Discussion
Training loss curve and Training Accuray curve using default hyperparameters (
batch_size=100
, max_epoch=10
, learning_rate=0.01
, lamda=0.5
) are shown as below.Adjust each hyperparameter while holding others the same, and compare the results with the default result.
Batch_size
Holding other hyperparameters the same, change
batch_size
to 10, we get below results.It can be observed that though the training process converge in the end, both the training loss and the training accuracy fluctuate more. Itβs reasonal because we are using SGD, and at every iteration, we update as follows:
Smaller batch_size means more iterations and more varient and inconsistent gradient descents, which result in above larger fluctuations.
Learning_rate
Holding the initial weights the same and other hyperparameters as default, set
learning_rate
equal to 0.1, 0.01, 0.001 separately. To make sure each training process can converge, set max_epoch=20
here. The results are shown as below.Test Accuracy:
learning rate | 0.1 | 0.01 | 0.001 |
test accuracy | 0.7982 | 0.8189 | 0.8246 |
It can be observed that the smaller the learning_rate is, the more iterations are needed to converge and the less fluctuation during training. The result is consistent with the SGD algorithm we are using.
Lamda
Holding the initial weights the same and other hyperparameters as default, set
lamda
equal to 0.2, 0.5, 0.8 separately. The results are shown as below.lamda | 0.2 | 0.5 | 0.8 |
test accuracy | 0.8477 | 0.8250 | 0.8001 |
, namely weight decay parameter, is used for regularization. It can avoid over-fitting. However, if is too large, it may make the model under-fitting. As above results shown, compared with the default setting 0.5, the performance is better when setting .
Loading Comments...