Bahdanau Attention
In order to understand modern attention architectures, it makes sense to study the historical context in which these architectures were developed and the problems that the new systems tried to solve. For that purpose let's remember the encoder-decoder architecture from the last chapter and try to figure out in what regard this design might be problematic.
Let's imagine we are trying to solve a translation task with and enocder-decoder architecture. The encoder takes the sequence in the original language as input and returns a single vector, marked as h_4 undefined . In other words the whole meaning of the original language is compressed into a single vector.
The decoder uses this hidden vector and previously generated words as input and generates a translation one word at a time. As the hidden vector moves through the decoder it gets modified and the original meaning of the sentence gets more and more diluted. By the time this vector arrives at the end of the decoder, hardly anything is left from the input language and the translation quality suffers.
In order to tackle the above problem Dzmitry Bahdanau and his colleagues developed the so called Bahdanau Attention[1] . The authors had a simple yet powerful idea to take all outputs from the encoder as inputs into the decoder at each step of the decoding process and thus reduced the information bottleneck that results from relying on solely one vector. Obviously not all encoder outputs are relevant equally for each part of the decoder section. So at each step of the decoding (translation) process the decoder pays attention to certain parts of the encoder outputs and weighs each accordingly. The weighted sum is eventually used as the input into the decoder network. In this section we will refer to this variable as the context c_i undefined .
There might be different strategies to use the context as an input to the decoder. In our implementation we simply concatenate the previous decoder output with the context and use that as the input.
In order to calculate the context we have to take a series of steps. In the very first step we calculate the so called energy, e_{ij} = a(s_{i-1}, h_j) undefined . At each decoding/translation step we measure the engergy between the previously generated hidden state of the decoder s_{i-1} undefined and each of the enocder outputs. The energy measures the the strength of the connection between a decoder output h_j undefined and the previous decoder output s_{i-1} undefined . Given that we have 4 encoder outputs if we want to generate the energies needed for the s_2 undefined state, we calculte the following energies: e_{21} = a(s_{1}, h_1) undefined , e_{22} = a(s_{1}, h_2) undefined , e_{23} = a(s_{1}, h_3) undefined and e_{23} = a(s_{1}, h_4) undefined . The higher the energy, the higher the attention to that particular encoder output is going to be. The function a undefined is implemented as a neural network, that is trained jointly with other parts of the whole encoder-decoder architecture. Calculating the actual attention weights \alpha undefined for the decoder input i undefined towards the encoder output j undefined is just a matter of using the engergies as an input into the softmax function.
Finally we use the attention weights to calculate the weighted sum of encoder outputs, the context vector, which is used as the input into the decoder together with the previously generated word token.