Optimizers
In deep learning the specific gradient descent algorithm is called an
optimizer. So far we have only really looked at the
plain vanilla gradient descent optimizer called SGD
, short for
stochastic gradient descent.
optimizer = optim.SGD(model.parameters(), lr=0.01)
With each batch we use the backpropagation algorithm to calculate the gradient vector \mathbf{\nabla}_w undefined . The gradient descent optimizer directly subtracts the gradient, scaled by the learning rate \alpha undefined , from the weight vector \mathbf{w} undefined without any further adjustments.
\mathbf{w}_{t+1} := \mathbf{w}_t - \alpha \mathbf{\nabla}_w undefinedAs you can probably guess this is not the only and by far not the fastest approach available. Other optimizers have been developed over time that generally converge a lot faster.
Momentum
The plain vanilla gradient descent algorithm lacks any form of memory. This optimizer only takes the gradient direction from the current batch into consideration and disregards any past gradient calculations.
When we use stochastic gradient descent with momentum on the other hand, we keep a moving average of the past directions and use that average additionally to the current gradient to adjust the weights.
\mathbf{m_t} = \beta \mathbf{m}_{t-1} + (1 - \beta) \mathbf{\nabla}_w undefinedAt each timestep t undefined we calculate the momentum vector \mathbf{m}_t undefined as a weighted average of the previous momentum \mathbf{m}_{t-1} undefined and the current gradient \mathbf{\nabla}_w undefined , where \beta undefined is usually around 0.9. As the initial momentum vector \mathbf{m}_0 undefined is essentially empty, deep learning frameworks like PyTorch initialize the vector by setting the momentum to the actual gradient vector.
\mathbf{m}_0 = \mathbf{\nabla}_w undefinedWhen we apply gradient descent, we do not use the gradient vector \nabla undefined directly to adjust the weights of the neural network, but use momentum instead.
\mathbf{w}_{t+1} := \mathbf{w}_t - \alpha \mathbf{m}_t undefinedIn PyTorch we can use gradient descent with momentum by passing an
additional argument to the SGD
object.
optimizer = optim.SGD(model.parameters(), lr=0.01, momentum=0.9)
But why is momentum actually useful? Below we see the same example with the local minimum, that we studied the first time we encountered gradient descent. The example showed, that gradient descent will get stuck in a local minimum. Gradient descent with momentum on the other has a chance to escape the local minimum.
Even when we are dealing with a direct path towards the minimum without any saddle points and local minima, the momentum optimizer will build acceleration and converge faster towards the minimum. Below we compare the convergence speed between simple stochastic gradient descent and momentum for x^2 + y^2 undefined . The momentum based approach arrives faster at the optimum.
RMSProp
Adaptive optimizers, like RMSProp[1] , do not adjust speed per se, but determine a better direction for gradient descent. If we are dealing with a bowl shaped loss function for example, the gradients will not be symmetrical. That means that we will approach the optimal value not in a direct line, but rather in a zig zagging manner.
We would like to move more in a the x direction and less in the y direction, which would result in a straight line towards the optimium. Theoretically we could offset the zig zag by using an individual learning rate for each of the weights, but given that there are million of weights in modern deep learning, this approach is not feasable. Adaptive optimizers scale each gradient in such a way, that we approach the optimum in a much straighter line. These optimizers allow to use a single learning rate for the whole neural network.
Similar to momentum, RMSProp (root mean squared prop) calcualtes a moving average, but instead of tracking the gradient, we track the squared gradient.
\mathbf{d_t} = \beta_2 \mathbf{d}_{t-1} + (1 - \beta_2) \mathbf{\nabla}_w^2 undefinedThis root of the vector \mathbf{d} undefined is used to scale the gradient. This causes the gradients to get similar in magnitute (which creates a straighter line), while still following the general direction that is encoded in the moving average.
\mathbf{w}_{t+1} := \mathbf{w}_t - \alpha \dfrac{\mathbf{\nabla}_w}{\sqrt{\mathbf{d}_t} + \epsilon} undefinedThe \epsilon undefined varialble is a very small positive number that is used in order to avoid divisions by 0.
Below we compare vanilla gradient descent, gradient descent with momentum and RMSProp on a loss function with an elongated form. While the simple gradient descent and momentum gradient descent approach the optimum in a curved manner, RMSProp takes basically a straight route. Also notice, that momentum can overshoot due to gained speed and needs some time to reverse direction.
The api for all optimizers in PyTorch is identical, so we can simply replace
the SGD
object with the RMSprop
object and we are good
to go.
optimizer = optim.RMSprop(model.parameters(), lr=0.01)
Adam
Adam[1] is the combination of momentum and adaptive learning. If you look at the equations below, you will not find any new concepts. We calculate moving averages of the gradients and the squared gradients. The RMSProp style scaling is not applied directly to the gradient vector, instead we scale the momentum vector and use the result to adjust the weights.
\begin{aligned} \mathbf{m_t} &= \beta_1 \mathbf{m}_{t-1} + (1 - \beta_1) \mathbf{\nabla}_w \\ \mathbf{d_t} &= \beta_2 \mathbf{d}_{t-1} + (1 - \beta_2) \mathbf{\nabla}_w^2 \\ \mathbf{w}_{t+1} & := \mathbf{w}_t - \alpha \dfrac{\mathbf{m}_t}{\sqrt{\mathbf{d}_t} + \epsilon} \end{aligned} undefinedAdam (and its derivatives) is probably the most used optimizer at this point in time. If you don't have any specific reason to use a different optimizer, use adam.
We can implement the adam optimizer in PyTorch the following way.
optimizer = optim.Adam(model.parameters(), lr=0.01)