Batch Normalization

In a previous chapter we have discussed feature scaling. Feature scaling only applies to the input layer, so should we try to scale the intermediary features that are produced by hidden units? Would that be in any form benefitiary for trainig?

Sergey Ioffe and Christian Szegedy answered the question with a definitive yes[1] . When we add so called batch normalizaton to hidden features, we can speed up the training process significantly, while gaining other additional advantages.

Consider a particular layer l undefined , to which output we would like to apply batch normalization. Using a batch of data we calculate the mean \mu_j undefined and the variance \sigma_j^2 undefined for each hidden unit j undefined in the layer.

undefined

Given those parameters we can normalize the hidden features, using the same procedure we used for feature scaling.

undefined

The authors argued that this normalization procedure might theoretically be detremental to the performance, because it might reduce the expressiveness of the neural network. To combat that they introduced an additional step that allowed the neural network to reverse the standardization.

undefined

The feature specific parameters \gamma undefined and \beta undefined are learned by the neural network. If the network decides to set \gamma_j undefined to \sigma_j undefined and \beta_j undefined to \mu_j undefined that essentially neutralizes the normalization. If normalization indeed worsens the performance, the neural network has the option to reverse the normalization step.

Our formulations above indicated that batch normalization is applied to activations. This procedure is similar to input feature scaling, because you normalize the data that is processed in the next layer.

In practice though batch norm is often applied to the net inputs and the result is forwarded to the activation function.

There is no real consensus about how you should apply batch normalization, but this decision in all likelihood should not make or break your project.

There is an additional caveat. In practice we often remove the bias term b undefined from the calculation of the net input z = \mathbf{xw^T} + b undefined . This is done due to the assumption, that \beta undefined essentially does the same operation. Both are used to shift the calculation by a constant and there is hardly a reason to do that calculation twice.

The authors observed several adantages that batch normalization provides. For once batch norm makes the model less sensitive to the choice of the learning rate, which allows us to increase the learning rate and thereby increase the speed of convergence. Second, the model is more forgiving when choosing bad initial weights. Third, batch normalization seems to help with the vanishing gradients problem. Overall the authors observed a significant increase in training speed, thus requiring less epochs to reach the desired performance. Finally batch norm seems to act as a regularizer. When we train the neural network we calculate the mean \mu_j undefined and the standard deviation \sigma_j undefined one batch at a time. This calculation is noisy and the neural network has to learn to tune out that noise in order to achieve a reasonable performance.

During inference the procedure of calculating per batch statistics would cause problems, because different inference runs would generate different means and standard deviations and therefore generate different outputs. We want the neural network to be deterministic during inference. The same inputs should always lead to the same outputs. For that reason during training the batch norm layer calculates a moving average of \mu undefined and \sigma undefined that can be used at inference time.

Let also mention that no one really seems to know why batch norm works. Different hypotheses have been formulated over the years, but there seems to be no clear consensus on the matter. All you have to know is that batch normalization works well and is almost a requirement when training modern deep neural networks. This technique will become one of your main tools when designing modern neural network architectures.

PyToch has an explicit BatchNorm1d module that can be applied to a flattened tensor, like the flattened MNIST image. The 2d version will become important when we start dealing with 2d images. Below we create a small module that combines a linear mapping, batch normalization and a non-linear activation. Notice that we we provide the linear module with the argument bias=False in order to deactivate the bias calculation.

HIDDEN_FEATURES = 70
class BatchModule(nn.Module):
    def __init__(self):
        super().__init__()
        self.layers = nn.Sequential(
            nn.Linear(HIDDEN_FEATURES, HIDDEN_FEATURES, bias=False),
            nn.BatchNorm1d(HIDDEN),
            nn.ReLU()
        )
    
    def forward(self, features):
        return self.layers(features)

We can reuse the above defined module several times.

class Model(nn.Module):
    def __init__(self):
        super().__init__()
        self.layers = nn.Sequential(
                nn.Flatten(),
                nn.Linear(NUM_FEATURES, HIDDEN_FEATURES),
                BatchModule(),
                BatchModule(),
                BatchModule(),
                nn.Linear(HIDDEN, NUM_LABELS),
            )
    
    def forward(self, features):
        return self.layers(features)

As the batch normalization layer behaves differently during training and evalutation, don't forget to switch between model.train() and model.eval().