Word Embeddings

In the previous sections we have learned how we can use recurrent neural networks to deal with sequences. Yet we still face a problem that we need to solve, before we can train those models on textual data.

Info

A neural network takes only numerical values as input, while text is represented as a sequence of characters or words. In order to make text compatible with nearal networks, it needs to be transformed into a numerical representation. In other words: text needs to be vectorized.

In order to get a good intuition for the vectorization process let's work through a dummy example. The whole time we will assume that our dataset consists of these two sentences.

Charles III is the king of the United Kingdom.

Queen Elizabeth II ruled for 70 years.

In the first step of the transformation process we need to tokenize our sentences. During the tokenization process we divide the sentence into its atomic parts, so called tokens. While theoretically we can divide a sentence into a set of characters (like letters), usually a sentence is divided into individual words (or subwords). Tokenization can be a daunting task, so we will stick to the basics here.

CharlesIIIisthekingoftheUnitedKingdom.
QueenElizabethIIruledfor70years.

During tokenization, the words are often also standardized, by stripping punctuation and turning letters into their lower case counterparts.

charlesiiiisthekingoftheunitedkingdom
queenelizabethiiruledfor70years

Once we have tokenized all words, we can create a vocabulary. A vocabulary is the set of all available tokens. For the sentences above we will end up with the following vocabulary.

<unk>
<pad>
charles
is
the
king
of
united
kingdom
queen
elizabeth
ruled
for
70
years

You have probably noticed that additionally to the tokens we have derived from our dataset we have also introduced <pad> and <unk>. For the most part the size of the sentences is going to be of different size, but if we want to use batches of samples, we need to standardize the length of the sequence. For that purpose we use padding, which means that we fill the shorter sentences with <pad> tokens. The token for unknown words <unk> is used for words that are outside of the vocabulary. This happens for example if the vocabulary that is built using the training dataset does not contain some words from the testing dataset. Additionally we often limit the size of the vocabulary in order to save computational power. In our example we assume that roman numerals are extremely rare and replace them by the special tokens.

charles<unk>isthekingoftheunitedkingdom
queenelizabeth<unk>ruledfor70years <pad><pad>

In the next step each token in the vocabulary gets assigned an index.

<unk>
0
<pad>
1
charles
2
is
3
the
4
king
5
of
6
united
7
kingdom
8
queen
9
elizabeth
10
ruled
11
for
12
70
13
years
14

Next we replace all tokens in the sentence by the corresponding index.

203456478
910011121314 11

Theoretically we have already accomplished the task of turning words into numerical values, but using indices as input inot the neural netowrk is problematic, because those indices imply that there is a ranking in the words. So the word with the index 2 is somehow higher than the word with the index 1. Instead we create so called one-hot vectors. These vectors have as many dimensions, as there are tokens in the vocabulary. For the most part the vector consists of zeros, but at the index that corresponds to the word in the vocabulary the value is 1. For our vocabulary of size 15, so we have access to 15 one-hot vectors.

0 100000000000000
1 010000000000000
2 001000000000000
3 000100000000000
4 000010000000000
5 000001000000000
6 000000100000000
7 000000010000000
8 000000001000000
9 000000000100000
10 000000000010000
11 000000000001000
12 000000000000100
13 000000000000010
14 000000000000001

Our first sentence for example would correspond to a sequence of the following one-hot vectors.

2 001000000000000
0 100000000000000
3 000100000000000
4 000010000000000
5 000001000000000
6 000000100000000
4 000010000000000
7 000000010000000
8 000000001000000

While we have managed to turn our sentences into vectors, this is not the final step. One-hot vectors are problematic, because the dimensionaly of vectors growth with the size of the vocabulary. We might deal with a vocabulary of 30,000 words, which will produce vectors of size 30,000. If we input those vectors directly into a recurrent neural network, the computation will become intractable.

Instead we first turn the one-hot representation into a dense representation of lower dimensionality, those vectors are called embeddings.

For example those embeddings might look like the following vecors. We turn 15-dimensional sparse vectors into 4-dimensional dense vectors.

0 0.990.120.030.52
1 0.870.380.280.46
2 0.530.860.370.80
3 0.470.630.160.22
4 0.190.630.520.97
5 0.420.780.750.61
6 0.620.940.740.48
7 0.700.800.550.72
8 0.350.090.940.70
9 0.790.330.600.41
10 0.480.990.780.03
11 0.980.890.000.26
12 0.340.940.570.78
13 0.930.550.490.31
14 0.430.230.470.14

Theoretically such a word embedding matrix can be trained using a fully connected layer. Assuming that we have a 5-dimensional one-hot vector and that we want to turn it into a 2-dimensional word embedding, we define the word embedding matrix as trainable weights of the corresponding size.

undefined

When we want to obtain the embedding for a corresponding one-hot vector, we multiply the two.

undefined

The multiplication will select the correct row and result in a 2-dimensional vector, that can be used as an input into our sequence-language model. This operation will be tracked by the autograd package and those weights will update over the time, optimizing the embedding representation.

In practice all major frameworks have a dedicated embedding layer, that does this operation via a lookup. Instead of actually using matrix multiplication, this layer takes the value from the embedding, that corresponds to the index of the word. This is just a more efficient approach, but the results of the computation should be the same.

The nn.Embedding layer from PyTorch has two positional arguments: the first corresponds to the size of the vocabulary and the second corresponds to the dimension of the embedding vector.

vocabulary_size = 10
embedding_dim = 4
batch_size=5
seq_len=3

We assume that we have 5 sentences, each consisting of 3 words.

sequence = torch.randint(low=0, 
                         high=vocabulary_size, 
                         size=(batch_size, seq_len))
print(sequence.shape)
torch.Size([5, 3])

The embedding maps directly from one of ten indices to the 4 dimensional embedding and there is no need to create one-hot encodings in PyTorch.

embedding = nn.Embedding(num_embeddings=vocabulary_size, 
                         embedding_dim=embedding_dim)
print(embedding.weight.shape)
torch.Size([10, 4])
embeddings = embedding(sequence)
print(embeddings.shape)
torch.Size([5, 3, 4])