πŸ’‘

L10. Recurrent Neural Networks

TOC

1. Simple RNNs

1.1 Feedback Connections

In some applications like speech recognition, we need to model a dynamic system. It can not be well established using Feed-forward networks where there are no feedback. Hence we introduce recurrent networks.
notion image
With feedback connections, the state (and therefore outputs) of neurons will change over time.

1.2 RNNs in general

The states of the neurons in RNN evolve over time
  • denotes the states of all neurons
  • denotes input to the network
  • denotes output of the network
  • can be linear or nonlinear. If depends on explicitly, the system is Non-autonomous, otherwise autonomous.
Often, the output neurons are separated from the above equation
where denotes the output funtion.
Such systems are termed (discrete) dynamic systems.

1.3 Elman Network & Jordan Network

An overview history of RNN is shown as below.
notion image
Elman network has below structure
notion image
The dynamic system is
  • : input; : output
  • : hidden state
  • and : activation functions
  • : learnable parameters
Jordan network has below structure
notion image
The dynamic system is
  • : input; : output
  • : hidden state
  • and : activation functions
  • : learnable parameters

1.4 Back-Propagation through time (BBPT)

Unfold the temporal operation of the network into a layered feedforward network, the topology of which grows by on layer at every time step.
Consider a linear system without input as below left shown:
notion image
notion image
Unfold it through time to above right structure
  • Numbers of each layer’s neurons are the same
  • Using shared weights
1.4.1 Unfold the Elman network
notion image
There are four cases when unfolding the Elman network:
  1. as below shown, is only present at the first step and label is only present at the last step.
    1. notion image
  1. as below shown, is fixed but present at all steps while label is only present at the last step. One application example of such case is image classification. (arrows in the same color share weights)
    1. notion image
  1. as below shown, is time-varying while label is only present at the last step. One application example of such case is sentence classification. (arrows in the same color share weights)
    1. notion image
      can be viewed as layer 0 attached to the orange backbone, arrows in the same color share weights.
  1. as below shown, is time-varying and label is present at all steps (arrows in the same color share weights). One application example of sucha case is speech recognition.
    1. notion image
Besides, if is only present at the first step while label is time-varying, it can be applied to β€œfigure-to-sentence”.
1.4.2 Simplified illustration of Elman network
Use circles to represent vectors (one circle one layer β‡’ put time step into superscript).
notion image
The forward propagation runtime is and cannot be reduced because there are time sequential dependence.
1.4.3 Unfold the Jordan network
notion image
If the loss is based on comparing and , all time steps are decoupled and training can be parallelized (testing cannot be parallelized).
1.4.4 Teacher forcing
Some networks such as Jordan network, have connections from the output at one time step to values computed in the next time step. During training, instead of feed outcome of the dynamic system (), we can directly feed the label . This method is called Teacher forcing and can be applied parallelized.
notion image
However, in testing, there is no reference signal and we have to use the network’s output at time . The kind of inputs that the network sees during training could be quite different from the kind of inputs that it will see at test time (Exposure bias) which will result in error accumulation.
To mitigate this problem, we can
  • Alternately use teacher-forced inputs and free-running inputs for a number of time steps.
  • Randomly choose between the teacher-forced input and free-running input at every time step.

1.5 Bidirectional RNN & Deep RNN

In many applications, the prediction at the current time step may depend on the whole input sequence (both the past and the future). Bidirection RNNs combine an RNN that moves forward through time with another RNN that moves backward through time. The output of the entire network at every time step then receives two inputs. One example is shown as below.
notion image
notion image
The output receives input from both sub-RNNs via and .
Structure of Bidirectional RNN
Structure of Bidirectional RNN
Deep RNNs Structures
Deep RNNs Structures
There are many ways to construct deep RNNs as above right shown.

2. Gated RNNs

2.1 Challenges to the Elman network

Consider the 1D Elman network
Suppose is an identity mapping and is a constant
  • If , will approach infinity
  • If , will converge to a fixed point
notion image
Even for time-varying , when , will also approach infinity
Let , after steps from zero
if , the contribution of to exponentially decays when increases β‡’ is mainly determined by recent input, the short-term memory is too short.
notion image

2.2 Long short-term memory (LSTM) cell

It can be viewed as a combination of the Jordan network and the Elman network.
  • The output is connected to the input
  • A self-loop is used to capture the information about the past
From Jordan network to LSTM cell
notion image
Redraw the Jordan network:
  • Use circles to denote operations
  • Variables are indicated on arrows
  • Bias is ignored.
Step 1: Add a self-loop
notion image
is either logistic sigmoid function or tanh function. is often tanh function.
Step 2: Add three gates
notion image
Gates are introduced to adaptively control the flow of information. All gates are determined by the input and
where is the logistic sigmoid function. Sometimes, they are also determined by and .
There is the ideal case for keeping the memory obtained at forever, that is, and for .
Advantage of LSTM
The gates enable the model to keep the memory for a longer time than simple RNNs. It is used in a more broad sense: attention.

2.3 Gated recurrent unit (GRU) cell

In the Elman network, the hidden units are used to capture the history information
In an LSTM cell without gates, a new vector is introduced for this purpose
We can actually use directly, which is the 1st idea of GRU
notion image
where and
The 2nd idea of GRU: Let output be equal to hidden states:
The 3rd idea of GRU: Use a gate to modulate the recurrent input
The gates depend on input and hidden states
  • Update gate
  • Reset gate , where is the logistic sigmoid function.
The complete dynamic equations of GRU is then
where is either the logistic sigmoid function or tanh function.

Appendix: RNN, LSTM and GRU on SST-5

πŸ’‘
HW5 Report: SST-5 Sentiment Analysis

Loading Comments...