TOC
1. Layer Decomposition
Using the idea of layer decomposition in Lecture 4, We can decompose a neural network into basic units as above figure shown, which have similar formats of forward and backward input-output. The neural network is a stacked architecture of basic layers, including FC layer, Sigmoid layer, ReLU layer, EuclideanLoss layer, SoftmaxCrossEntropyLoss layer in this homework (MLP). The forward and backward formula of basic layers are shown as below.
Note: denotes batch size since we are using SGD algorithm; denotes the input / output size of layer respectively.
1.1 FC layer
Forward
Backward
Parameters’ Gradient
1.2 Sigmoid layer
Forward
Backward
Parameter’s Gradient
There’s no parameter in Sigmoid layer.
1.3 ReLU layer
Forward
Backward
Parameter’s Gradient
There’s no parameter in ReLU layer.
1.4 EuclideanLoss layer
Forward (loss)
here is equal to class number. is the batch’s label matrix (one hot).
Backward
Parameter’s Gradient
There’s no parameter in EuclideanLoss layer.
1.5 SoftmaxCrossEntropyLoss layer
Forward (loss)
where
here is equal to class number. is the batch’s label matrix (one hot).
Backward
Parameter’s Gradient
There’s no parameter in SoftmaxCrossEntropyLoss layer.
2. MLP Construction
A complete MLP can be constructed by stacking above basic layers. An example discussed below has the architecture
[FC1 → Sigmoid → FC2 → MSE]
2.1 Forward Caculation
As above figure shown, the forward calculation process is
2.2 Backward Propagation
As above figure shown, the backward propagation process is
During the BP process, parameters’ gradients are also calculated and stored for following updating.
2.3 Parameters Updating (Optimizer)
Take FC2 for example, during BP process, and are calculated and stored.
where is the learning rate.
2.1~2.3 constitute a complete iteration process.
3. Experiment Results
Experiment 1: MLP with Euclidean Loss using Sigmoid / ReLU activation function
There is only one hidden layer in this experiment. Using Sigmoid and ReLU activation function respectively, the training loss and accuracy curves are shown as below.
The test accuracy of MLP using Sigmoid activation function is 0.9245 while it is 0.9636 for MLP using ReLU activation function. There is no significant difference in training time between these two architectures (both 20 epochs, ~14 min on Colab).
It can be observed from above figures and test accuracy that ReLU is a better choice of activation function which can result in higher accuracy and more adorable performance. It can be explained by gradient vanishing. While Sigmoid activation function suffers from gradient vanishing, ReLU activation function does not have such issue.
Experiment 2: MLP with Softmax Cross-Entropy Loss using Sigmoid / ReLU activation function
There is only one hidden layer in this experiment. Using Sigmoid and ReLU activation function respectively, the training loss and accuracy curves are shown as below.
The test accuracy of MLP using Sigmoid activation function is 0.9347 while it is 0.9744 for MLP using ReLU activation function. There is no significant difference in training time between these two architectures (both 20 epochs, ~14 min on Colab).
Also, it can be observed from the experiment results that using ReLU is better than Sigmoid and the reason is the same as previous. Besides, MLP with Softmax Cross-entropy loss using Sigmoid / ReLU activation function performs better than MLP with Euclidean loss using Sigmoid / ReLU respectively.
Experiment 3: Deeper or Wider
In this experiment, two MLPs are constructed, one has two hidden layers called deep one while the other has only one hidden layer called wide one. They have the same number of connections between neurons (comparable number of parameters):
- Deep MLP:
[FC(784, 256) → Sigmoid → FC(256, 128) → Sigmoid → FC(128, 10) → SoftmaxCross-Entropy Loss]
- Wide MLP:
[FC(784, 296) → Sigmoid → FC(296, 10) → SoftmaxCross-Entropy Loss]
Training loss and accuracy curves are shown as below. The test accuracy of Deep MLP and Wide MLP are both 0.9788. There is no significant difference in training time between these two architectures (both 20 epochs, ~29 min on Colab).
From above figures, we can observe that Deep MLP performs better than Wide MLP even if they both have comparable number of parameters. For Deep MLP, there are two hidden layers each followed by an activation funtion (non-linear function) while for Wide MLP, there is only one hidden layer and one activation function. An extra activation function can make the network fit non-linear functions better, resulting in higher capacity.
Loading Comments...