💡

HW5 Report: SST-5 Sentiment Analysis

TOC

1. Data Description

In this homework, we are using the Stanford Sentiment Treebank dataset with 5 kinds of labels (SST-5). Original dataset is in the tree form, we can use pytreebank package to convert it into tabular form. Using torchtext (version 0.9.0 in the notebook), we can and load the SST-5 and process it to obtain common dataloaders.
The labels of the dataset have five catrgories: ['positive': 1, 'negative': 2, 'neutral': 3, 'very positive': 4, 'very negative': 5]. The distributions of labels in train_set, valid_set and test_set and distribution of sentence length are shown as below.
Distribution of Categories and Sentence Length
Distribution of Categories and Sentence Length

2. Models

2.1 RNN, LSTM, GRU cell

RNN, LSTM and GRU cells all have following structure. and are the initial hidden state, as inputs to the first hidden layer. As input feeded sequentially, we can get output (sequence) and the last hidden state . For LSTM, it also produce in each step and in the end while for RNN and GRU do not. More details are elaborated in the course notes.
Forward Process of Basic RNN, LSTM, GRU cell
Forward Process of Basic RNN, LSTM, GRU cell
In pytorch’s nn module, there are already implemented basic cells, i.e. nn.RNN, nn.GRU, nn.lstm. Critical hyper-parameters include:
  • input_size: length of embedded vector
  • hidden_size: length of the layer’s output (for each entry)
  • num_layers: how many basic cells are stacked
  • bidirectional: whether use bidirectional structure or not.
    • The bidirectional structure is shown as following. Compared to single-directional structure in which each state only obtains hidden states from previous inputs, bidirectional structure allows each state to obtain both hidden states from previous and following inputs.
Structure of Bidirectional RNNs
Structure of Bidirectional RNNs
  • dropout: set dropout rate of layer’s output in order to regularize

2.2 Network Architecture

The complete network architecture is the same for RNN, LSTM and GRU in this experiment despite the basic cells.
Network Architecture
Network Architecture
where N is batch size, K is hidden size, E is length of embedded vectors and LEN is sentence length of the batch (more details about torchtext.data.BucketIterator , simply speaking, it can aggregate sentences having similar lengths together and pad them to a fixed length to obtain a batch).

3. Experiment Results

3.1 Architecture and Training Settings

RNNnet( (embedding): Embedding(18280, 300) (rnn): RNN(300, 64, num_layers=3, batch_first=True, dropout=0.4, bidirectional=True) (fc): Linear(in_features=384, out_features=5, bias=True) ) # i.e. K=64, E=300 (N=64, LEN is variable for each batch) LSTMnet( (embedding): Embedding(18280, 300) (lstm): LSTM(300, 64, num_layers=3, batch_first=True, dropout=0.4, bidirectional=True) (fc): Linear(in_features=384, out_features=5, bias=True) ) # i.e. K=64, E=300 (N=64, LEN is variable for each batch) GRUnet( (embedding): Embedding(18280, 300) (gru): GRU(300, 64, num_layers=3, batch_first=True, dropout=0.4, bidirectional=True) (fc): Linear(in_features=384, out_features=5, bias=True) ) # i.e. K=64, E=300 (N=64, LEN is variable for each batch)
Besides, all models use Adam optimizer, CrossEntropyloss function , batch size of 64 and are trained for 40 epoches.

3.2 Training Process and Evaluation

The training error and accuracy on train_set and valid_set during the training process are shown as below.
Training Error and Accuracy (0~30 epochs)
Training Error and Accuracy (0~30 epochs)
Training Error and Accuracy (0~15 epochs)
Training Error and Accuracy (0~15 epochs)
It can be observed that all models are over-fitted after about 10 epochs training. The valid accuracy are relatively low because of the difficulty of this task and the limitation of models’ capacity. Actually, even the SOTA models nowadays can only achieve around 60% accuracy on this task according to the leaderboard.
The test accuracy and training time of total 40 epochs are shown as below.
Model
Test Accuracy
Training Time
(vanilla) RNN
33.30%
399.8 s
LSTM
35.11%
410.7 s
GRU
33.67%
404.9 s
It can be observed that both LSTM and GRU have better capacity than the vanilla RNN while both cost more training time. The results are reasonable considering the more complicated structures of LSTM and GRU.

Loading Comments...