Recurrent Neural Networks

Let's start this section by contrasting and comparing a plain vanilla feed forward neural network with a recurrent neural network.

Let's assume for the moment that we are dealing with a single neuron that receives a single input. This input could for example be the current temperature level and our prediction is the temperature for the next day. The feedforward neural network processes the input and generates the output. Once the input has left the neuron it is forgotten. This neuron has no memory.

When the model is dealing with sequences, it should probably remember at least some parts of the previous inputs. The meaning of a sentence for example depends on the understanding of the whole sentence and not a single word. A similar argument can be made for the prediction of the temperature. It would probably be useful for the model to remember the temperature of the previous couple of days. We could try to circumvent the problem by adding additional neuron. Two neurons for example could be used to represent the temperature from the past day and the day before that. The output of the two neurons would be passed to the next layer.

The above approach does not completely solve the problem though. Many sequences have a variable length. The length of a sentence that we would like to translate for example can change dramatically. We need a more flexible system. A recurrent neural network offers a way out.

A recurrent neural network (often abbreviated as RNN) processes each piece of a sequence at a time. At each time step the neuron takes a part of the sequence and its own output from the previous timestep as input. In the very first time step there is no output from the previous step, so it is common to use 0 instead.

Below for example we are dealing with a sequence of size 4. This could for example be temperature measuremets from the 4 previous days. Once the sequence is exhaused, the output is sent to the next unit, for example the next recurrent layer.

When you start to study recurrent neural networks, you might encounter a specific visual notation for RNNs, similar to the one below. This notation represents a recurrent neural network as a self referential unit.

As the neural network has to remember the output from the previous run, you can say that it posesses a type of a memory. Such a unit is therefore often called a memory cell or simply cell.

So far we used only two numbers as an input into a RNN: the current sequence value and the previous output. In reality this cell works with vectors just like a feedforward neural network. Below for example the unit takes four inputs: two come from the part of a sequence and two from the previous output.

We can unroll the recurrent neural network through time. Taking the example with a four part sequence from before, the unrolled network will look as follows.

While the unrolled network looks like it consists of four units, you shouldn't forget that we are dealing with the same layer. That means that each of the boxes in the middle has the same weights and the same bias. At this point we also make a distinction between outputs \mathbf{y_t} undefined and the hidden units \mathbf{h_t} undefined . For the time being there is no difference between the hidden units and the outputs, but we will see shortly that there might be differences.

We use two sets of weights to calculate the hidden value \mathbf{h}_t undefined : the weight to process the previous hidden values \mathbf{W_h} undefined and the weights to process the sequence \mathbf{W_x} undefined . The hidden value is therefore calculated as \mathbf{h_t} = f(\mathbf{h_{t-1}}\mathbf{W_h}^T + \mathbf{x_t} \mathbf{W_x}^T + b) undefined . The activation function that is used most commonly with recurrent neural networks is tanh. Because we use the very same weights for the whole sequence, if the weights are above 1, we will deal with exploding gradients, therefore a saturating activation function is preferred. On the other hand a long sequence like a sentence or a book, that can consist of hundreds of steps, will cause vanishing gradients. We will look into ways of dealing with those in the next sections.

We will not go over the whole process of backpropagation for recurrent neural networks, called backpropagation through time. Still we will give you an intuition how you might approach calculating gradients for a RNN. In essence backpropagation for fully connected neural networks and RNNs is not different. We can use automatic differentiation the same way we did in the previous chapters. When you unroll a recurrent neural network, each part of a sequence is processed by the same weights and gradients are accumulated for those weights in the process. Once the whole sequence is exhausted, we can use backpropagation and apply gradient descent.

Often we will want to create several recurrent layers. In that case the hidden outputs of the series are used as the inputs into the next layer.

This time around there is a destinction between the output and the hidden values. We regard the output to be the hidden values from the very last layer.

In PyTorch we can either use the nn.RNN module or the nn.RNNCell. Both can be used to achive the same goal, but nn.RNN unrolls the neural net automatically, while the nn.RNNCell module needs to be applied to each part of the sequence manually. Often it is more convenient to simply use nn.RNN, but some more complex architecures will require us to use of nn.RNNCell.

# number of samples in our dataset
batch_size=4
#sequence lengths represents for example the number of words in a sentence 
sequence_length=5
# dimensionality of each input in the sequence
# so each value in the sequence is a vector of length 6
input_size=6
# the output dimension of each RNN layer
hidden_size=3
# number of recurrent layers in the network
num_layers=2

A recurrent neural network in PyTorch uses an input of shape of (sequence length, batch size, input_size) as the default. If you set the parameter batch_first to True, then you must provide the shape (batch size, sequence length, input_size). For now we will use the default behaviour, but in some future examples it will be convenient to set this to True.

We create a module and generate two tensors: the first is our dummy sequence and the second is the initial value for the hidden state.

rnn = nn.RNN(input_size=input_size, 
             hidden_size=hidden_size, 
             num_layers=num_layers,
             nonlinearity='tanh')

sequence = torch.randn(sequence_length, batch_size, input_size)
h_0 = torch.zeros(num_layers, batch_size, hidden_size)

The recurrent network generates two outputs. The output tensor corresponds to the \mathbf{y} undefined values from the diagrams above. We get an output vector of dimension 3 for each of the 5 values in the sequence and each of the 4 batches, therefore the output dimension is (5, 4, 3). The h_n tensor contains the last hidden values for all layers. This would correspond to \mathbf{h}_n undefined values in the diagram above. Given that we have 2 layers, 4 batches and hidden units of dimension 3, the dimensionality is (2, 4, 3).

with torch.inference_mode():
    output, h_n = rnn(sequence, h_0)
print(output.shape, h_n.shape)

torch.Size([5, 4, 3]) torch.Size([2, 4, 3])

When we want to have more control over the learning process, we might need to resort to nn.RNNCell. Each such cell represents a recurrent layer, so if you want to use more leayers, you have to create more cells.

cell = nn.RNNCell(input_size=input_size, 
                    hidden_size=hidden_size, 
                    nonlinearity='tanh')

sequence = torch.randn(sequence_length, batch_size, input_size)
h_n = torch.zeros(batch_size, hidden_size)

This time we loop over the sequence manually, always using the last hidden state as the input in the next iteration.

with torch.inference_mode():
    for t in range(sequence_length):
        h_n = cell(sequence[t], h_t)
print(h_t.shape)

torch.Size([4, 3])