TOC
1. Simple RNNs1.1 Feedback Connections1.2 RNNs in general1.3 Elman Network & Jordan Network1.4 Back-Propagation through time (BBPT)1.5 Bidirectional RNN & Deep RNN2. Gated RNNs2.1 Challenges to the Elman network2.2 Long short-term memory (LSTM) cell2.3 Gated recurrent unit (GRU) cellAppendix: RNN, LSTM and GRU on SST-5
1. Simple RNNs
1.1 Feedback Connections
In some applications like speech recognition, we need to model a dynamic system. It can not be well established using Feed-forward networks where there are no feedback. Hence we introduce recurrent networks.
With feedback connections, the state (and therefore outputs) of neurons will change over time.
1.2 RNNs in general
The states of the neurons in RNN evolve over time
- denotes the states of all neurons
- denotes input to the network
- denotes output of the network
- can be linear or nonlinear. If depends on explicitly, the system is Non-autonomous, otherwise autonomous.
Often, the output neurons are separated from the above equation
where denotes the output funtion.
Such systems are termed (discrete) dynamic systems.
1.3 Elman Network & Jordan Network
An overview history of RNN is shown as below.
Elman network has below structure
The dynamic system is
- : input; : output
- : hidden state
- and : activation functions
- : learnable parameters
Jordan network has below structure
The dynamic system is
- : input; : output
- : hidden state
- and : activation functions
- : learnable parameters
1.4 Back-Propagation through time (BBPT)
Unfold the temporal operation of the network into a layered feedforward network, the topology of which grows by on layer at every time step.
Consider a linear system without input as below left shown:
Unfold it through time to above right structure
- Numbers of each layerβs neurons are the same
- Using shared weights
1.4.1 Unfold the Elman network
There are four cases when unfolding the Elman network:
- as below shown, is only present at the first step and label is only present at the last step.
- as below shown, is fixed but present at all steps while label is only present at the last step. One application example of such case is image classification. (arrows in the same color share weights)
- as below shown, is time-varying while label is only present at the last step. One application example of such case is sentence classification. (arrows in the same color share weights)
can be viewed as layer 0 attached to the orange backbone, arrows in the same color share weights.
- as below shown, is time-varying and label is present at all steps (arrows in the same color share weights). One application example of sucha case is speech recognition.
Besides, if is only present at the first step while label is time-varying, it can be applied to βfigure-to-sentenceβ.
1.4.2 Simplified illustration of Elman network
Use circles to represent vectors (one circle one layer β put time step into superscript).
The forward propagation runtime is and cannot be reduced because there are time sequential dependence.
1.4.3 Unfold the Jordan network
If the loss is based on comparing and , all time steps are decoupled and training can be parallelized (testing cannot be parallelized).
1.4.4 Teacher forcing
Some networks such as Jordan network, have connections from the output at one time step to values computed in the next time step. During training, instead of feed outcome of the dynamic system (), we can directly feed the label . This method is called Teacher forcing and can be applied parallelized.
However, in testing, there is no reference signal and we have to use the networkβs output at time . The kind of inputs that the network sees during training could be quite different from the kind of inputs that it will see at test time (Exposure bias) which will result in error accumulation.
To mitigate this problem, we can
- Alternately use teacher-forced inputs and free-running inputs for a number of time steps.
- Randomly choose between the teacher-forced input and free-running input at every time step.
1.5 Bidirectional RNN & Deep RNN
In many applications, the prediction at the current time step may depend on the whole input sequence (both the past and the future). Bidirection RNNs combine an RNN that moves forward through time with another RNN that moves backward through time. The output of the entire network at every time step then receives two inputs. One example is shown as below.
The output receives input from both sub-RNNs via and .
There are many ways to construct deep RNNs as above right shown.
2. Gated RNNs
2.1 Challenges to the Elman network
Consider the 1D Elman network
Suppose is an identity mapping and is a constant
- If , will approach infinity
- If , will converge to a fixed point
Even for time-varying , when , will also approach infinity
Let , after steps from zero
if , the contribution of to exponentially decays when increases β is mainly determined by recent input, the short-term memory is too short.
2.2 Long short-term memory (LSTM) cell
It can be viewed as a combination of the Jordan network and the Elman network.
- The output is connected to the input
- A self-loop is used to capture the information about the past
From Jordan network to LSTM cell
Redraw the Jordan network:
- Use circles to denote operations
- Variables are indicated on arrows
- Bias is ignored.
Step 1: Add a self-loop
is either logistic sigmoid function or tanh function. is often tanh function.
Step 2: Add three gates
Gates are introduced to adaptively control the flow of information. All gates are determined by the input and
where is the logistic sigmoid function. Sometimes, they are also determined by and .
There is the ideal case for keeping the memory obtained at forever, that is, and for .
Advantage of LSTM
The gates enable the model to keep the memory for a longer time than simple RNNs. It is used in a more broad sense: attention.
2.3 Gated recurrent unit (GRU) cell
In the Elman network, the hidden units are used to capture the history information
In an LSTM cell without gates, a new vector is introduced for this purpose
We can actually use directly, which is the 1st idea of GRU
where and
The 2nd idea of GRU: Let output be equal to hidden states:
The 3rd idea of GRU: Use a gate to modulate the recurrent input
The gates depend on input and hidden states
- Update gate
- Reset gate , where is the logistic sigmoid function.
The complete dynamic equations of GRU is then
where is either the logistic sigmoid function or tanh function.
Loading Comments...