💡

L11. Word Representation

TOC

1. Word Representation

1.1 One-hot

One-hot representation
notion image
Dimensionality: the vocabulary size , which could be millions.
The problems of one-hot representation:
  • Dimensionaity is high
  • Does not represent the relationship between words

1.2 Co-occurrence Matrix

Motivation: we can get a lot of value by representing a word by means of its neighbors.
There are two options when using co-occurrence matrix: full documetns v.s. windows
  • Word-document co-occurrence matrix will give general topics leading to “Latent Semantic Analysis”.
  • The window around each word captures both syntactic and semantic information.
notion image
Consider we have vocabularies and documents. For each word , we can count the frequency of appearing in the documents and we have a vector
to denote the frequency distribution.
Stacking all together, we get a matrix
Make a singular value decomposition
use the matrix as a representation of the vocabularies in dimension space.
The disadvantage of this method is memory consuming, we need the whole matrix , which is a global information, though, this method is much better than the one-hot representation.
Interpretation of SVD
The -th row of can be interpreted as the -th word coordinates in high dimension . For example, think as a company, and -th row as this firm’s characteristics. The high dimensionality corresponds to the fact that we observe high dimension of characteristics. To reduce dimensionality, we make linear combinations of the characters to generate a sequence of new characteristics (coordinates). A linear combination to maximize the variance (maintain as much information as possible):
Then is the eigen-vector with the largest eigen-value, denoted as .
Similarly, we can construct the second characteristic (coordinate) using a linear combination of , denoted as , such that
subject to .
We can obtain , the eigenvector of associated with the second largest eigen value.
Following a similar procedure, we can obtain , the leading eigenvector of , corresponding to the leading eigenvalues and we can construct new features
This exactly corresponds to the first columns of matrix in singular value decomposition (SVD)
Thus, each column of corresponds to one (new) feature of the words, the leading columns correspond to features with large variations, and the tail ones have low variations. Besides .
Window based co-occurence matrix
notion image
In the window-based co-occurrence matrix, each row defines the features of each word: a word is defined by a distribution of co-occurrence words. In the document-based matrix, however, a word is defined by a distribution of the word over documents. Similarly, once we obtain the co-occurrence matrix , we can make SVD
The procedure to make a word representation is the same as the word-document co-occurrence.

1.3 Dense Vector

The idea is to represent words by low-dimensional vectors and store “most” of the important information in a fixed, small number of dimensions: a dense vector.
  • Usually around 25~1000 dimensions
  • It’s easy to perform tasks like classification, generation, etc. based on this representation.
Methods to learn the word vectors include: A neural probabilistic language model and A recent simpler and faster model (word2vec).

2. Word2Vec

Main idea of Word2Vec: Instead of capturing co-occurence counts directly (highly time and computation cost since we need to run the whole datasets), we can parameterize the word and then estimate the parameters (word embeddings).
Notations:
  • Each word is represented by two vectors:
    • : the coordinate when the word is used as a center word
    • : the coordinate when the word is used as a surrounding word (context)
    • : can be estimated by minimizing some stuff
    • Suppose we want to embed each word in dimensions
  • :
    • vertically stacking the center-word representation, each row represents a word’s center-word representation.
    • A column represents a feature of the word
  • :
    • horizontally stacking the context-word representation, each column represent a word’s context-word representation.

2.1 CBOW

CBOW Architecture
CBOW Architecture
Continuous bag of words: use the surrounding words (context) to predict a center word.
Consider we have a sample with window size ,
and we try to minimize the negative log likelihood
Denote the word representation (as a context) as and
The CBOW method assumes a simple average of the context words as a sufficient statistics
where is the simple average of surrounding words.

2.2 Skip-Gram

Architecture of Skip-Gram Model
Architecture of Skip-Gram Model
. : each row is the vector of a word in the vocabulary.
Each row of corresponds to a word: “input vector”. Each column of corresponds to a word: “output vector”. Clearly, we need to predefine the correspondence between the rows of and the words in the corpus. In order to do training, the same order of columns must be used in . After learning, the -th row of and the -th column of can be averaged to represent the -th word.
Objective: given an input word, maximize the prob of surrounding words. The prob of a word at location () is assumed conditionally independent, i.e.,
During training, a word , is given as the desired output word (observed), the cross-entropy loss:
We have the same requirement at other locations in the window. Sum the losses over all locations, we obtain
where is the window size and denotes all parameters.
Average over all given input
It’s equivalent to maximizing the average log probability. BP algorithm and SGD are used to train the model.

Loading Comments...