Vanishing and Exploding Gradients
We expect the performance of a neural network to improve when we add more layers to its architecture. A deep neural network has more degrees of freedom to fit to the data than a shallow neural network and should thereby perform much better. In the very least the neural network should be able to overfit to the training data and to display decent performance on the training dataset. Yet the opposite is the case. When you naively keep adding more and more layers to the neural network, the performance will start to deterioarate until the network is not able to learn anything at all. This has to do with the so called vanishing or exploding gradients. The vanishing gradient problem especially plagued the machine learning community for a long period of time, but by now we have some excellent tools to deal with those problems.
To focus on the core idea of the problem, we are going to assume that each layer has just one neuron with one weight and no bias. While this is an unreasonable assumption, the ideas will hold for much more complex neural networks.
The forward pass is straighforward. We iterate between the calculation of the net value z^{<l>} undefined and the neuron output a^{<l>} undefined until we are able to calculate the final activation a^3 undefined and the loss L undefined .
In the backward pass we calculate the derivative of the loss with respect to weights of different layers by using the chain rule over and over again. For the first weigth w^{<1>} undefined the calculation of the derivative would look as follows.
\dfrac{d}{dw^{<1>}} Loss = \dfrac{dLoss}{da^{<3>}} \boxed{ \dfrac{da^{<3>}}{dz^{<3>}} \dfrac{dz^{<3>}}{da^{<2>}} \dfrac{da^{<2>}}{dz^{<2>}} \dfrac{dz^{<2>}}{da^{<1>}} \dfrac{da^{<1>}}{dz^{<1>}} } \dfrac{dz^{<1>}}{dw^{<1>}} undefinedIf you look at the boxed calculations, you should notice that the same type of calculations are repeated over and over again: \dfrac{da}{dz} undefined and \dfrac{dz}{da} undefined . We would encounter the same pattern even if we had to deal with 100 layers. If we can figure out the nature of those two derivatives we might understand what the value of the overall derivative looks like.
So far we have exclusively dealt with the sigmoid activation function \dfrac{1}{1 + e^{-z}} undefined , therefore the derivative of \dfrac{da^{<l>}}{dz^{<l>}} undefined is a^{<l>}(1-a^{<l>}) undefined . When we draw both the activation functions and the derivative, we notice, that the derivative of the sigmoid approaches 0, when the net input gets too large or too small. At its peak the derivative is exactly 0.25.
If we assume the best case scenario, we can replace \dfrac{da^{<l>}}{dz^{<l>}} undefined by 0.25 and we end up with the following calculatoin of the derivative.
\dfrac{d}{dw^{<1>}} Loss = \dfrac{dLoss}{da^{<3>}} \boxed{ 0.25 \dfrac{dz^{<3>}}{da^{<2>}} 0.25 \dfrac{dz^{<2>}}{da^{<1>}} 0.25 } \dfrac{dz^{<1>}}{dw^{<1>}} undefinedEach additional layer in the neural network forces the derivative to shrink by at least 4. With just 5 layers we are dealing with the factor close to 0 undefined .
Given that the sigmoid derivative \dfrac{da^{<l>}}{dz^{<l>}} undefined is always between 0.25 and 0, we have to assume, that the overall derivative \dfrac{dL}{dw^{<1>}} undefined approaches 0 when the number of layers starts to grow. Layers that are close to the output layer are still able to change their respective weights appropriately, but the farther the layers are removed from the loss, the closer the multiplicator gets to 0 and the closer the derivative gets to 0. The weights of the first layers remain virtually unchanged from their initial values, preventing the neural network from learning. That is the vanishing gradient problem.
The derivative \dfrac{dz^{<l>}}{da^{<l-1>}} undefined on the other hand is just the corresponding weight w^{<l>} undefined .
Assuming for example that w^{<2>} undefined and w^{<3>} undefined are both 0.95, we would deal with the following gradient.
\dfrac{d}{dw^{<1>}} Loss = \dfrac{dLoss}{da^{<3>}} \boxed{ \dfrac{da^{<3>}}{dz^{<3>}} 0.95 \dfrac{da^{<2>}}{dz^{<2>}} 0.95 \dfrac{da^{<1>}}{dz^{<1>}} } \dfrac{dz^{<1>}}{dw^{<1>}} undefinedHere we can make a similar argument that we did with the derivative of the sigmoid. When the derivatives of weights are between 0 and 1, the gradients in the first layers will approach 0.
Obviously unlike with the sigmoid, weights do not have any lower or higher bounds. All weights could therefore be in the range w > 1 undefined and w < - 1 undefined . If each weight corresponds to exactly 2, then the gradient will grow exponentially.
\dfrac{d}{dw^{<1>}} Loss = \dfrac{dLoss}{da^{<3>}} \boxed{ \dfrac{da^{<3>}}{dz^{<3>}} 2 \dfrac{da^{<2>}}{dz^{<2>}} 2 \dfrac{da^{<1>}}{dz^{<1>}} } \dfrac{dz^{<1>}}{dw^{<1>}} undefinedThat could make the gradients in the first layers enormous, leading to the so called exploding gradient problem. Gradient descent will most likely start to diverge and at some point our program will throw an error, as the gradient will overflow.
Info
Derivatives of activation functions and weights have a significant impact on whether we can train a deep neural network successfully or not.
The remedies to those problems will for the most part deal with adjustmens to weights and activation functions. This will be the topic of this chapter.