Optimizers

In deep learning the specific gradient descent algorithm is called an optimizer. So far we have only really looked at the plain vanilla gradient descent optimizer called SGD, short for stochastic gradient descent.

optimizer = optim.SGD(model.parameters(), lr=0.01)

With each batch we use the backpropagation algorithm to calculate the gradient vector undefined . The gradient descent optimizer directly subtracts the gradient, scaled by the learning rate undefined , from the weight vector undefined without any further adjustments.

undefined

As you can probably guess this is not the only and by far not the fastest approach available. Other optimizers have been developed over time that generally converge a lot faster.

Momentum

The plain vanilla gradient descent algorithm lacks any form of memory. This optimizer only takes the gradient direction from the current batch into consideration and disregards any past gradient calculations.

When we use stochastic gradient descent with momentum on the other hand, we keep a moving average of the past directions and use that average additionally to the current gradient to adjust the weights.

undefined

At each timestep undefined we calculate the momentum vector undefined as a weighted average of the previous momentum undefined and the current gradient undefined , where undefined is usually around 0.9. As the initial momentum vector undefined is essentially empty, deep learning frameworks like PyTorch initialize the vector by setting the momentum to the actual gradient vector.

undefined

When we apply gradient descent, we do not use the gradient vector undefined directly to adjust the weights of the neural network, but use momentum instead.

undefined

In PyTorch we can use gradient descent with momentum by passing an additional argument to the SGD object.

optimizer = optim.SGD(model.parameters(), lr=0.01, momentum=0.9)

But why is momentum actually useful? Below we see the same example with the local minimum, that we studied the first time we encountered gradient descent. The example showed, that gradient descent will get stuck in a local minimum. Gradient descent with momentum on the other has a chance to escape the local minimum.

-40-20020406080100120 -3-2-101234567

Even when we are dealing with a direct path towards the minimum without any saddle points and local minima, the momentum optimizer will build acceleration and converge faster towards the minimum. Below we compare the convergence speed between simple stochastic gradient descent and momentum for undefined . The momentum based approach arrives faster at the optimum.

undefined
undefined
Vanilla Gradient Descent Gradient Descent With Momentum

RMSProp

Adaptive optimizers, like RMSProp[1] , do not adjust speed per se, but determine a better direction for gradient descent. If we are dealing with a bowl shaped loss function for example, the gradients will not be symmetrical. That means that we will approach the optimal value not in a direct line, but rather in a zig zagging manner.

-101 -101

We would like to move more in a the x direction and less in the y direction, which would result in a straight line towards the optimium. Theoretically we could offset the zig zag by using an individual learning rate for each of the weights, but given that there are million of weights in modern deep learning, this approach is not feasable. Adaptive optimizers scale each gradient in such a way, that we approach the optimum in a much straighter line. These optimizers allow to use a single learning rate for the whole neural network.

Similar to momentum, RMSProp (root mean squared prop) calcualtes a moving average, but instead of tracking the gradient, we track the squared gradient.

undefined

This root of the vector undefined is used to scale the gradient. This causes the gradients to get similar in magnitute (which creates a straighter line), while still following the general direction that is encoded in the moving average.

undefined

The undefined varialble is a very small positive number that is used in order to avoid divisions by 0.

Below we compare vanilla gradient descent, gradient descent with momentum and RMSProp on a loss function with an elongated form. While the simple gradient descent and momentum gradient descent approach the optimum in a curved manner, RMSProp takes basically a straight route. Also notice, that momentum can overshoot due to gained speed and needs some time to reverse direction.

-2-101 -2-101 Vanilla Gradient Descent Gradient Descent With Momentum RMSProp

The api for all optimizers in PyTorch is identical, so we can simply replace the SGD object with the RMSprop object and we are good to go.

optimizer = optim.RMSprop(model.parameters(), lr=0.01)

Adam

Adam[1] is the combination of momentum and adaptive learning. If you look at the equations below, you will not find any new concepts. We calculate moving averages of the gradients and the squared gradients. The RMSProp style scaling is not applied directly to the gradient vector, instead we scale the momentum vector and use the result to adjust the weights.

undefined

Adam (and its derivatives) is probably the most used optimizer at this point in time. If you don't have any specific reason to use a different optimizer, use adam.

We can implement the adam optimizer in PyTorch the following way.

optimizer = optim.Adam(model.parameters(), lr=0.01)

Notes

  1. RMSProp was developed by Geoffrey Hinton for a deep learning course on the Coursera plattform. You can access the original materials at https://www.cs.toronto.edu/~hinton/. Lecture 6 is the relevant one.

References

  1. Diederik Kingma and Jimmy Ba. Adam: A Method for Stochastic Optimization. (2014).