Minimizing MSE
Single Training Sample
Let us remind ourselves that our goal is to minimize the mean squared error, by tweaking the weight vector \mathbf{w} undefined and the bias scalar b undefined using gradient descent.
Info
MSE=\dfrac{1}{n}\sum_i^n (y^{(i)} - \hat{y}^{(i)})^2 \\ undefined
\hat{y}^{(i)} = \mathbf{x}^{(i)} \mathbf{w}^T + b undefinedThe computation of gradients in the deep learning world relies heavily on the chain rule.
Info
If we have a composite function z(y(x)) undefined , we can calculate the derivative of z undefined with respect of x undefined by applying the chain rule.
\dfrac{dz}{dx} = \dfrac{dz}{dy} \dfrac{dy}{dx} = \dfrac{dz}{\xcancel{dy}} \dfrac{\xcancel{dy}}{dx} = \dfrac{dz}{dx} undefinedThe mean squared error (y - [xw + b])^2 undefined is a great example of a composite function. We start by calculating the scaled feature s undefined , where s = w\times x undefined . We use s as an input and calculate the linear regression prediction, \hat{y} = s + b undefined . Using \hat{y} undefined as input, we can calculate the error e undefined as the difference between the prediction and the true target y undefined , e = y - \hat{y} undefined . Finally we calculate the mean squared error for one single training example, mse = e^2 undefined .
If we utilize the chain rule, we can calculate the derivative of the mean squared error with respect to the bias b undefined and the weight w undefined by multiplying the intermediary derivatives.
Info
For didactic reasons we are focusing on linear regression with a single feature, but for the most part the calcultion with more than one feature would be the same.
\dfrac{\partial mse}{\partial w_j} = \dfrac{\partial mse}{\partial e} \times \dfrac{\partial e}{\partial \hat{y}} \times \dfrac{\partial \hat{y}}{\partial s} \times \dfrac{\partial s}{\partial w_j} undefined In that case we have to calculate as many partial derivatives, as there are features.Calculating those intermediary derivatives is relatively straightforward, using basic rules of calculus.
Using the chain rule we can easily calculate the derivatives with respect to the weight and bias.
Once we have the gradients, the gradient descent algorithm works as expected.
Info
\mathbf{w}_{t+1} \coloneqq \mathbf{w}_t - \alpha \mathbf{\nabla}_w undefined
b_{t+1} \coloneqq b_t - \alpha \dfrac{\partial}{\partial b} undefinedComputationial Graph
We have mentioned before, that the chain rule plays an integral role in deep learning, but what we have covered so far has probably not made it clear why it is so essential.
Info
We can construct a so called computational graph and use the chain rule for automatic differentiation.
So let's construct the computational graph for the mean squared error step by step and see how automatic differentiation looks like.
A computational graph basically tracks all atomic calculatios and their results in a tree like structure. We start out the calculation of MSE by creating the weight node w undefined and the feature node x undefined , but at that point those two variables are not connected to each other yet. Once we calculate w*x undefined , we connect the two and the result is a new node, that lies on a higher level.
The yellow box represents the operations (like additions and multiplications), blue boxes contains the values of the node, while the red boxes contain the derivatives. We assume a feature of 5 and the weight of 1 at the beginning, so the resulting node has the value of 5. At this point in time we have not calculated any derivatives yet, so all derivatives will amount to 0 for now.
Next we create the bias b undefined node with the initial value of 1 and add the node to the previously calculated node. Once again those nodes are reflected in the computational graph.
We keep creating new nodes, and connect them to previous generated nodes and the graph keeps growing. We do that until we reach the final node, in our case the mean squared error.
Once we have created our comutational graph, we can start calculating the gradients and applying the chain rule. This time around we start with the last node and go all the way to the nodes we would like to adjust: the weigth and the bias.
Remember that we want to calculate the derivatives of MSE with respect to the weight and the bias. To achieve that we need to calculate the intermediary derivatives and apply the chain rule along the way. First we calculate the derivative of MSE with respect to itself. Basically we are asking ourselves, how much does the MSE change when we increase the MSE by 1. As expected the result is just 1.
This might seem to be an unnecessary step, but the chain rule in the computational graph relies on multiplying intermediary derivatives by the derivative of the above node and if we left the value of 0, the algorithm would not have worked.
Next we go a level below and calculate the local derivative of the error with respect to MSE. The derivative of e^2 undefined is just 2 \times error undefined , wich is 28. We apply the chain rule and multiply 28 by the derivative value of the above node and end up with 28.00.
We keep going down and calculate and calculate the next intermediary derivatives. This time around we face two nodes. The target node and the prediction node, where Error = Target - Prediction undefined . The intermediary derivative of the error node with respect to the target is just 1. The intermediary derivative of the error with respect to the prediction node is -1. Multiplying the intermediary derivatives with the derivative from the above node yields 28 and -28 respectively. These are the derivatives of the mean squared error with respect to the target and the prediction.
If we continue doing these calculations we will eventually end up with the below graph.
Once we have the gradients, we can apply the gradient descent algorithm. The below example iterates between constructing the graph, calculating the gradients and applying gradient descent, eventually leading to a mean squared error of 0.
The best part about the calculations above is that we do not have to track the nodes manually or to calculate the gradients ourselves. The connections between the nodes happen automatically. When we do any operations on these node objects, the computational graph is adjusted automatically to reflect the operations. The same goes for the calculation of the gradients, thus the name autodiff. Because the local nodes operations are usually very common operations like additions, multiplications and other common operations that are known to the deep learning community, we know how to calculate the intermediary derivatives. This behaviour is implemented in all deep learning packages and we do need to think about these things explicitly.
There are several more advantages to utilizing a computation graph. For once, this allows us to easily calculate the gradients of arbitrary expressions, even if the expressions are extremely complex. As you can imagine, neural networks are such expressions, but if you only think in terms of local node gradients and the gradients that come from the node above, the calculations are rather simple to think about. Secondly, often we can reuse many gradient calculations, as these are used as inputs in multiple locations. For example the prediction node derivative is moved down to the bias and the scaled value and we only need to calculate the above derivative once. This is especially important for neural networks, as this algorithm allows us to efficiently distribute gradients, thus saving a lot of computational time.
At this point we should mention, that in deep learning the graph construction phase and the gradient calculation phase have distinct names, that you will hear over and over again. The so called forward pass is essentially the graph constructon phase. In the backward pass we start to propagate the gradients from the the mean squared error to the weights and bias using automatic differentiation. This algorithm of finding the gradients of individual nodes, by utilizing the chain rule and a computational graph is also called backpropagation. Backpropagation is the bread and butter of modern deep learning.
Multiple Training Samples
As it turns out making a jump from one to several samples is not that complicated. Let's for example assume that we have two samples. This creates two branches in our computational graph.
The branch nr. 2 that amounts to 196.0 is essentially the same path that we calulated above. For the other branch we would have to do the same calculations that we did above, but using the features from the other sample. To finish the calculations of the mean squared error, we would have to sum up the two branches and calculate the average. When we initiate the backward pass, the gradients are propagated into the individual branches. You should remember the the same weights and the same bias exist in the different branches. That means that those parameters receive different gradient signals. In that case the gradients are accumulated through a sum. This is the same as saying: "the derivative of a sum is the sum of derivatives". You should also observe, that the mean squared error scales each of the gradient signals by the number of samples that we use for training. If we have two samples, each of the gradients is divided by 2 (or multiplied by 0.5).
So the calculations remain very similar: construct the graph and distribute the gradients to the weight and bias using automatic differentiation. When we use the procedures described above it does not make a huge difference whether we use 1, 4 or 100 samples.
To convince ourselves that automatic differentiation actually works, below we present the example from the last section, that we solve by implementing a custom autodiff package in JavaScript.
Autograd
As mentioned before, in modern deep learning we do not iterate over individual samples to construct a graph, but work with tensors to parallelize the computations. When we utilize any of the modern deep learning packages and use the provided tensor objects, we get parallelisation and automatic differentiation out of the box. We do not need to explicitly construct a graph and make sure that all nodes are connected. PyTorch for example has a built in automatic differentiation library, called autograd, so let's see how we can utilize the package.
import torch
import sklearn.datasets as datasets
We start by creating the features \mathbf{X} undefined and the labels \mathbf{y} undefined .
X, y = datasets.make_regression(n_samples=100, n_features=2, n_informative=2, noise=0.01)
We transorm the generated numpy arrays into PyTorch Tensors. For the labels
Tensor we use the unsqueeze(dim)
method. This adds an additional
dimension, transforming the labels from a (100,) into a (100, 1) dimensional
Tensor. This makes sure that the predictions that are generated in the the forward
pass and the actual labels have identical dimensions.
X = torch.from_numpy(X).to(torch.float32);
y = torch.from_numpy(y).to(torch.float32).unsqueeze(1)
We generate the init_weights()
function, which initializes the
weights and the biases randomly using the standard normal distribution. This
time around we set the requires_grad
property to
True
in order to track the gradients. We didn't do that for the
features and the label tensors, as those are fixed and should not be adjusted.
def init_weights():
w = torch.randn(1, 2, requires_grad=True)
b = torch.randn(1, 1, requires_grad=True)
return w, b
w, b = init_weights()
For the sake of making our explanations easier let us introduce the print_all()
function. This function makes use of the two imortant properties that each
Tensor object posesses. The data
and the grad
property.
Those properties are probaly self explanatory: data contains the actual values
of a tensor, while grad contains the gradiens with respect to each value in the
data list.
def print_all():
print(f'Weight: {w.data}, Grad: {w.grad}')
print(f'Bias: {b.data}, Grad: {b.grad}')
print_all()
Weight: tensor([[-0.6779, 0.4228]]), Grad: None Bias: tensor([[0.2107]]), Grad: None
When we print the data and the grad right after initializing the tensors,
the objects posess a randomized value, but gradients amount to None
.
def forward(w, b):
y_hat = X @ w.T + b
return ((y - y_hat)**2).sum() / 100.0
mse = forward(w, b)
Even when we calculate the mean squared error, by running through the forward pass, the gradients remain empty.
print_all()
Weight: tensor([[-0.6779, 0.4228]]), Grad: None Bias: tensor([[0.2107]]), Grad: None
To actually run the backward pass, we have to call the backward()
method on the loss function. The gradients are always based on the tensor that
initiated the backward pass. So if we run the backward pass on the mean squared
error tensor, the gradients tell us how we should shift the weights and the bias
to reduce the loss. This is exactly what we are looking for.
mse.backward()
print_all()
Weight: tensor([[-0.6779, 0.4228]]), Grad: tensor([[-102.5140, -98.1595]]) Bias: tensor([[0.2107]]), Grad: tensor([[4.4512]])
If we run the forward and the backward passes again, you will notice, that the weights and the bias gradients are twice as large. Each time we calculate the gradients, the gradients are accumulated. The old gradient values are not erased, as one might assume.
mse = forward(w, b)
mse.backward()
print_all()
Weight: tensor([[-0.6779, 0.4228]]), Grad: tensor([[-205.0279, -196.3191]]) Bias: tensor([[0.2107]]), Grad: tensor([[8.9024]])
Each time we are done with a gradient descent step, we should clear the
gradients. We can do that by using the zero_()
method, which zeroes
out the gradients inplace.
w.grad.zero_()
print(w.grad)
tensor([[0., 0.]])
Below we show the full implementation of gradiet descent. Most of the
implementation was already discussed before, but the context manager torch.inference_mode()
might be new to you. This part tells PyTorch to not include the following parts
in the computational graph. The actual gradient descent step is not part of the
forward pass and should therefore not be tracked.
lr = 0.1
w, b = init_weights()
for _ in range(10):
# forward pass
mse = forward(w, b)
print(f'Mean squared error: {mse.data}')
# backward pass
mse.backward()
# gradient descent
with torch.inference_mode():
w.data.sub_(w.grad * lr)
b.data.sub_(b.grad * lr)
w.grad.zero_()
b.grad.zero_()
Mean squared error: 6125.7783203125 Mean squared error: 4322.3662109375 Mean squared error: 3054.9150390625 Mean squared error: 2162.31787109375 Mean squared error: 1532.5537109375 Mean squared error: 1087.494873046875 Mean squared error: 772.5029907226562 Mean squared error: 549.2703857421875 Mean squared error: 390.8788757324219 Mean squared error: 278.37469482421875
We iterate over the forward pass, the backward pass and the gradient descent step for 10 iterations and the mean squared error decreases dramatically. This iteration process is called the training loop in deep learning lingo. We will encounter those loops over and over again.