💡

L12. Sentence Processing

TOC

1. Bag of Words

Without word embedding, one method to represent sentences/documnets is bag of words.
notion image
 
However, the disadvantage of this method is lack of order of the words. For example, the below two sentences have the same bag of words while their meanings are reverse.
  • “white blood cells destroying an infection”
  • “an infection destroying white blood cells”

2. Text Classification using CNN

The architecture is shown as following
notion image
Each word has a -dim real vector and each vector corresponds to a different .
The problem of this model:
  • It’s not deep enough: there is only one conv layer and one pooling layer
  • Features are not diverse enough:
    • each convolutional kernel results in a 1D feature map, i.e., a feature vector
    • the global max pooling on one feature vector, i.e., max pooling over all time, results in a scalar
A deeper model is shown as below
notion image
  • 1-D conv along the time axis
  • K-max pooling over time
  • Folding: elementwise, summation of two rows like a special average pooling

3. Text Classification using RNN

Unfold for the Elman network
notion image
notion image
Remarks:
  • LSTM, GRU, Bidirectional RNN can be used
  • is time-varying
  • label is only present at the last step

3.1 Seq2seq Learning

Usually it involves two RNNs: encoder and decoder. Many NLP tasks can be phrased as sequence-to-senquence.
  • Machine translation (French → English)
  • Summarization (long text → short text)
  • Dialogue (previous utterances →next utterance)
  • Parsing (previous utterance → next utterance)
  • Code generation (natural language → Python code, etc.)
Neural Machine Translation (NMT)
notion image
Encoder RNN produces an encoding of the source sentence while Decoder RNN generates target sentence conditioned on encoding.
For the encoder, one can use either pretrained word embedding, e.g., word2vec, or one-hot representation. For the decoder, the one-hot representation is used:
  • We need a prob for each reference word to define the objective function
  • Softmax function is usually used as the output
The encoder RNN and decoder RNN are often different, and deep RNNs can be used.
Conditional Language Model
The seq2seq model is an example of a conditional language model:
  • Language Model: the decoder is predicting the next word of the target sentence
  • Conditional: its predictions are also conditioned on the source sentence
NMT directly calculates , where ,
denotes the probability of next target word, given target words so far and source sentence .
Training an NMT system
notion image
Note that Seq2seq is optimized as a single system. There is a problem of this architecture because all information is encoded in the final state, that is, the last final state needs to capture all information about the source sentence, which is challenging (bottleneck problem).

3.2 Seq2seq with Attention

Attention provides a solution to the bottleneck problem. The core idea is that on each step of the decoder, we should focus on a particular part of the source sequence.
notion image
We set the weight of important information the largest by using , where is a function replicating relationship (similarity) between and . is called “key” and is called “query”. coule be dot product if and have same dimension.
As below figures shown,
notion image
the attention output is the weighted sum of the encoder hidden states.
Formulation (Cross attention)
Assume the encoder’s hidden states are . At time step , the decoder’s hidden state .
  • We get the attention scores for this step:
  • Take softmax to get the attention distribution for this step
  • Calculate the attention output
  • Finally, concatenate the attention output with the decoder hidden state and proceed as in the non-attention seq2seq model

4. Self-attention

4.1 Motivation

notion image
  • For RNN, to have interactions between two distant words, it needs steps
    • the memory of RNN may not be so long
    • it cannot be computed in parallel
  • For CNN, we need to go to deeper layers, but the representations of the word have changed a lot, which may not be what we want.
  • MLP is a direct method to solve this problem, but it cannot handle variable length input.
Attention is a satisfying solution.

4.2 Construction & Formulation

Recap the cross attention
notion image
Step 1: drop the decoder and consider the encoder alone.
notion image
Step 2: do not use RNN any more.
The variable length input problem is solved now.
notion image
In the output , every pair of words are interacted directly. If we could decide from the input words automatically, no matter what is, we have one . We can predefine the number of outputs ‘s. A simple method to determine the weights is to use as the query, then is a square matrix. And the number of ’s becomes .
Formal formulation
Input: , output:
however, the above formulation has no parameters. To parameterize the keys, queries and values:
and
Remarks:
  • If we use dot product (often the case) for , then .
  • is used to set the dimension of the output , otherwise the output would have the same dimension as the input.

4.3 Attention Block

We can simply stack the self-attention as following
notion image
Shortcomings:
  • No elementwise nonlinearities
    • Stacking more self-attention layers just re-averages value vectors (still linear function)
  • It does not know the order of the input (thus Transformer introduces Position encoding)
Add elementwise nonlinearity: add a feed-forward network to post-process each output vector:
notion image

5. Transformer

The architecture of Transformer is shown as below
notion image
Self-attention in Matrix-vector Form
Let’s use dot product for , then . Let , then
Let , where
then
Remarks:
  • Softmax is applied in each row
  • The -th row of Softmax() contains () to weight the rows of
Multi-headed Attention
To increase the capacity of the model, we introduce multiple matrices for each input . Suppose there are heads, for
notion image
where
Each attention head performs attention independently
The outputs of all the heads are then combined
the dimension of attention output is the same as before.
Scaled dot product
For each head in above equation , when is large, dot products between vectors tend to become large. When the inputs to softmax function is too large, the gradients will be small. In practice, we use the scaled dot product
Transformer encoder and decoder
notion image
Self-attention v.s. Cross-attention
In self-attention, the keys, queries and values come from the same source
notion image
In cross-attention
  • The keys and values come from the encoder’s output
  • The queries come from the decoder’s input
notion image
Other operations are the same as in the self-attention
Supplement Video: Illustrated Guide to Transformers Neural Network: A step by step explanation
Video preview
Elaboration of Transfomer based on Paper Attention Is All You Need

Loading Comments...