Gradient Descent
Before we discuss how we can find the optimal weights and the optimal bias in a linear regression setting, let us take a step back and consider how we can find the value of variable x undefined that minimizes the function f(x) = x^2 undefined .
The equation f(x) = x^2 undefined depicts a parabola. From visual inspection we can determine, that the f(x) undefined is lowest when x undefined . is exactly 0.
In machine learning we rarely have the luxury of being able to visually find the optimal solution. Our function is usually dependend on thousands or millions of features and that is not something that we can visualize. We need to apply an algorithmic procedure, that finds the minimum automatically. We start the algorithm by assigning x undefined a random value. In the example below we picked 55.
Next we calculate the derivative of f(x) undefined with respect to x undefined . Using the rules of basic calculus we derive \dfrac{\mathrm{d}}{\mathrm{d}x}f(x) = 2x undefined . The slope at our starting point is therefore 110. We can draw the tangent line at the starting point to visualize the derivative.
The derivative shows us the direction of steepest descent, or simply put the derivative tells us in what direction we have to change x undefined if we want to reduce f(x) undefined . The gradient descent algorithm utilizes that directions and simply subtract the derivative \dfrac{\mathrm{d}}{\mathrm{d}x}f(x) undefined from x undefined . Gradient descent is an iterative algorithm. That means that we keep calculating the derivative \dfrac{\mathrm{d}}{\mathrm{d}x}f(x) undefined and updating the variable x undefined until some criterion is met. For example once the change in x undefined is below a certain threshhold, we can assume that we are very close to the minimum.
While the derivative gives us the direction in which should take a step, the derivative does not give us the size of the step. For that purpose we use a variable \alpha undefined , also called the learning rate. The learning rate scales the derivative by multiplying the direction with a value that usually lies between 0.1 and 0.001. Larger values of the learning rate could make the algorithm diverge. That would mean that f(x) undefined would get larger and larger and never get close to the minimum. While too low values would slow down the trainig process dramatically.
Info
At each time step t undefined of the gradient descent algorithm we update the variable x undefined , until f(x) undefined converges to the miminum.
x_{t+1} \coloneqq x_t - \alpha \dfrac{\mathrm{d}}{\mathrm{d}x}f(x_t) undefinedBelow you can play with an interactive example to get some intuition regarding the gradient descent algorithm. Each click on the play button takes a single gradient descent step, based on the parameters that you can change with the sliders.
0.01
55.00
You can learn several things about gradient descent if you play with the example.
- If you try positive and negative x undefined values you will observe that the sign of the derivative changes based on the sign of the location of x undefined . That behaviour makes sure that we distract negative values from x undefined when x undefined is negative and we distract positive values from x undefined when x undefined is positive. No matter where we start, the algorithm always pushes the variable towards the minimum.
- If you try gradient descent with an \alpha undefined of 1.01 you will observe that the algorithm starts to diverge. Picking the correct learning rate is an extremely usefull skill and is generally on of the first things to tweak when you want your algorithm to perform better. In fact \alpha undefined is one of the so called hyperparameters. A hyperparamter is a parameter that is set by the programmer that influences the learning of the parameters that you are truly interested in (like w undefined and b undefined ).
- You should also notice the decrease of the magnitude of the derivative when we start getting closer and closer to the optimal value, whiel the slope of the tangent gets flatter and flatter. This natural behaviour makes sure that we take smaller and smaller steps as we start approaching the optimum. This also means that gradient descent does not find an optimal value for x undefined but an approximative one. In many cases it is sufficient to be close enough to the optimal value.
While the gradient descent algorithm is the de facto standard in deep learning, it has some limitations. Only when we are dealing with a convex function, we have a guarantee that the algorithm will converge to the global optimum. A convex function is like the parabola above, a function that is shaped like a "bowl". Such a "bowl" shaped function allows the variable to move towards the minimum without any barriers.
Below is the graph for the function f(x) = x^3 - 5x^2 + 10 undefined , a non convex function. We start at the x undefined position with a value of 6. If you apply gradient several times (arrow button) you will notice that the ball gets stuck in the local minimum and will thus never keep going into the direction of the global minimum. This is due to the fact, that at that local minimum point the derivative corresponds to 0 and the gradient descent algorithm breaks down. You could move the slider below the graph and place the ball to the left of 0 and observe that the ball will keep going and going further down.
6.00
This behaviour has several implications that we should discuss. First, the starting position of the variable matters and might have an impact on the performance. Second, the following question arises: "why do deep learning researchers and practicioners use gradient descent, if the neural network function is not convex and there is a chance that the algorithm will get stuck in a local minimum?". Simply put, because it works exceptionally well in practice. Additionally, we rarely use the "traditional" gradient descent algorithm in practice. Over time, researchers discovered that the algorithm can be improved by such ideas as "momentum", which keeps the speed of gradient descent over many iterations and might thus jump over the local minimum. We will cover those ideas later, for now lets focus on the basic algorithm.
Before we move on to the part where we discuss how we can apply this algorithm to linear regression, let us discuss how we can deal with functions that have more than one variable, for example f(x_1, x_2) = x_1^2 + x_2^2 undefined . The approach is actually very similar. Instead of calculating the derivative with respect to x undefined \dfrac{\mathrm{d}}{\mathrm{d}x}f(x) undefined we need to calculate the partial derivatives with respect to all variables, in our case \dfrac{\mathrm{\partial}}{\mathrm{\partial}x_1}f undefined and \dfrac{\mathrm{\partial}}{\mathrm{\partial}x_2}f undefined . For convenience we put the partial derivatives and the variables into their corresponding vectors.
\mathbf{x} = \begin{bmatrix} x_1 \\ x_2 \end{bmatrix} , undefined \mathbf{\nabla} = \begin{bmatrix} \dfrac{\mathrm{\partial}}{\mathrm{\partial}x_1}f \\[8pt] \dfrac{\mathrm{\partial}}{\mathrm{\partial}x_2}f \end{bmatrix} undefinedThe gradient descent algorithm looks almost the same. The only difference is the substitution of scalars for vectors.
\mathbf{x}_{t+1} \coloneqq \mathbf{x}_t - \alpha \mathbf{\nabla} undefinedThe vector that is represented by \nabla undefined (pronounced nabla) is called the gradient, giving its name to the gradient descent algorithm.