TOC
hw5_pipeline.ipynb
https://drive.google.com/file/d/1cDCK4azkhDrDgjhdsWkgQOaY8DjL_27N/view?usp=sharing
1. Data Description
In this homework, we are using the Stanford Sentiment Treebank dataset with 5 kinds of labels (SST-5). Original dataset is in the tree form, we can use
pytreebank
package to convert it into tabular form. Using torchtext
(version 0.9.0 in the notebook), we can and load the SST-5 and process it to obtain common dataloaders.The labels of the dataset have five catrgories: ['positive': 1, 'negative': 2, 'neutral': 3, 'very positive': 4, 'very negative': 5]. The distributions of labels in train_set, valid_set and test_set and distribution of sentence length are shown as below.
2. Models
2.1 RNN, LSTM, GRU cell
RNN, LSTM and GRU cells all have following structure. and are the initial hidden state, as inputs to the first hidden layer. As input feeded sequentially, we can get output (sequence) and the last hidden state . For LSTM, it also produce in each step and in the end while for RNN and GRU do not. More details are elaborated in the course notes.
In pytorch’s
nn
module, there are already implemented basic cells, i.e. nn.RNN, nn.GRU, nn.lstm
. Critical hyper-parameters include: - input_size: length of embedded vector
- hidden_size: length of the layer’s output (for each entry)
- num_layers: how many basic cells are stacked
- bidirectional: whether use bidirectional structure or not.
- The bidirectional structure is shown as following. Compared to single-directional structure in which each state only obtains hidden states from previous inputs, bidirectional structure allows each state to obtain both hidden states from previous and following inputs.
- dropout: set dropout rate of layer’s output in order to regularize
2.2 Network Architecture
The complete network architecture is the same for RNN, LSTM and GRU in this experiment despite the basic cells.
where
N
is batch size, K
is hidden size, E
is length of embedded vectors and LEN
is sentence length of the batch (more details about torchtext.data.BucketIterator
, simply speaking, it can aggregate sentences having similar lengths together and pad them to a fixed length to obtain a batch).3. Experiment Results
3.1 Architecture and Training Settings
RNNnet( (embedding): Embedding(18280, 300) (rnn): RNN(300, 64, num_layers=3, batch_first=True, dropout=0.4, bidirectional=True) (fc): Linear(in_features=384, out_features=5, bias=True) ) # i.e. K=64, E=300 (N=64, LEN is variable for each batch) LSTMnet( (embedding): Embedding(18280, 300) (lstm): LSTM(300, 64, num_layers=3, batch_first=True, dropout=0.4, bidirectional=True) (fc): Linear(in_features=384, out_features=5, bias=True) ) # i.e. K=64, E=300 (N=64, LEN is variable for each batch) GRUnet( (embedding): Embedding(18280, 300) (gru): GRU(300, 64, num_layers=3, batch_first=True, dropout=0.4, bidirectional=True) (fc): Linear(in_features=384, out_features=5, bias=True) ) # i.e. K=64, E=300 (N=64, LEN is variable for each batch)
Besides, all models use
Adam
optimizer, CrossEntropyloss
function , batch size of 64 and are trained for 40 epoches. 3.2 Training Process and Evaluation
The training error and accuracy on train_set and valid_set during the training process are shown as below.
It can be observed that all models are over-fitted after about 10 epochs training. The valid accuracy are relatively low because of the difficulty of this task and the limitation of models’ capacity. Actually, even the SOTA models nowadays can only achieve around 60% accuracy on this task according to the leaderboard.
The test accuracy and training time of total 40 epochs are shown as below.
Model | Test Accuracy | Training Time |
(vanilla) RNN | 33.30% | 399.8 s |
LSTM | 35.11% | 410.7 s |
GRU | 33.67% | 404.9 s |
It can be observed that both LSTM and GRU have better capacity than the vanilla RNN while both cost more training time. The results are reasonable considering the more complicated structures of LSTM and GRU.
Loading Comments...