TOC
1. Bag of Words
Without word embedding, one method to represent sentences/documnets is bag of words.
However, the disadvantage of this method is lack of order of the words. For example, the below two sentences have the same bag of words while their meanings are reverse.
- “white blood cells destroying an infection”
- “an infection destroying white blood cells”
2. Text Classification using CNN
The architecture is shown as following
Each word has a -dim real vector and each vector corresponds to a different .
The problem of this model:
- It’s not deep enough: there is only one conv layer and one pooling layer
- Features are not diverse enough:
- each convolutional kernel results in a 1D feature map, i.e., a feature vector
- the global max pooling on one feature vector, i.e., max pooling over all time, results in a scalar
A deeper model is shown as below
- 1-D conv along the time axis
- K-max pooling over time
- Folding: elementwise, summation of two rows like a special average pooling
3. Text Classification using RNN
Unfold for the Elman network
Remarks:
- LSTM, GRU, Bidirectional RNN can be used
- is time-varying
- label is only present at the last step
3.1 Seq2seq Learning
Usually it involves two RNNs: encoder and decoder. Many NLP tasks can be phrased as sequence-to-senquence.
- Machine translation (French → English)
- Summarization (long text → short text)
- Dialogue (previous utterances →next utterance)
- Parsing (previous utterance → next utterance)
- Code generation (natural language → Python code, etc.)
Neural Machine Translation (NMT)
Encoder RNN produces an encoding of the source sentence while Decoder RNN generates target sentence conditioned on encoding.
For the encoder, one can use either pretrained word embedding, e.g., word2vec, or one-hot representation. For the decoder, the one-hot representation is used:
- We need a prob for each reference word to define the objective function
- Softmax function is usually used as the output
The encoder RNN and decoder RNN are often different, and deep RNNs can be used.
Conditional Language Model
The seq2seq model is an example of a conditional language model:
- Language Model: the decoder is predicting the next word of the target sentence
- Conditional: its predictions are also conditioned on the source sentence
NMT directly calculates , where ,
denotes the probability of next target word, given target words so far and source sentence .
Training an NMT system
Note that Seq2seq is optimized as a single system. There is a problem of this architecture because all information is encoded in the final state, that is, the last final state needs to capture all information about the source sentence, which is challenging (bottleneck problem).
3.2 Seq2seq with Attention
Attention provides a solution to the bottleneck problem. The core idea is that on each step of the decoder, we should focus on a particular part of the source sequence.
We set the weight of important information the largest by using , where is a function replicating relationship (similarity) between and . is called “key” and is called “query”. coule be dot product if and have same dimension.
As below figures shown,
the attention output is the weighted sum of the encoder hidden states.
Formulation (Cross attention)
Assume the encoder’s hidden states are . At time step , the decoder’s hidden state .
- We get the attention scores for this step:
- Take softmax to get the attention distribution for this step
- Calculate the attention output
- Finally, concatenate the attention output with the decoder hidden state and proceed as in the non-attention seq2seq model
4. Self-attention
4.1 Motivation
- For RNN, to have interactions between two distant words, it needs steps
- the memory of RNN may not be so long
- it cannot be computed in parallel
- For CNN, we need to go to deeper layers, but the representations of the word have changed a lot, which may not be what we want.
- MLP is a direct method to solve this problem, but it cannot handle variable length input.
Attention is a satisfying solution.
4.2 Construction & Formulation
Recap the cross attention
Step 1: drop the decoder and consider the encoder alone.
Step 2: do not use RNN any more.
The variable length input problem is solved now.
In the output , every pair of words are interacted directly. If we could decide from the input words automatically, no matter what is, we have one . We can predefine the number of outputs ‘s. A simple method to determine the weights is to use as the query, then is a square matrix. And the number of ’s becomes .
Formal formulation
Input: , output:
however, the above formulation has no parameters. To parameterize the keys, queries and values:
and
Remarks:
- If we use dot product (often the case) for , then .
- is used to set the dimension of the output , otherwise the output would have the same dimension as the input.
4.3 Attention Block
We can simply stack the self-attention as following
Shortcomings:
- No elementwise nonlinearities
- Stacking more self-attention layers just re-averages value vectors (still linear function)
- It does not know the order of the input (thus Transformer introduces Position encoding)
Add elementwise nonlinearity: add a feed-forward network to post-process each output vector:
5. Transformer
The architecture of Transformer is shown as below
Self-attention in Matrix-vector Form
Let’s use dot product for , then . Let , then
Let , where
then
Remarks:
- Softmax is applied in each row
- The -th row of Softmax() contains () to weight the rows of
Multi-headed Attention
To increase the capacity of the model, we introduce multiple matrices for each input . Suppose there are heads, for
where
Each attention head performs attention independently
The outputs of all the heads are then combined
the dimension of attention output is the same as before.
Scaled dot product
For each head in above equation , when is large, dot products between vectors tend to become large. When the inputs to softmax function is too large, the gradients will be small. In practice, we use the scaled dot product
Transformer encoder and decoder
Self-attention v.s. Cross-attention
In self-attention, the keys, queries and values come from the same source
In cross-attention
- The keys and values come from the encoder’s output
- The queries come from the decoder’s input
Other operations are the same as in the self-attention
Loading Comments...