Feature Scaling

You will often face datasets with features that are based on different scales. One such dataset could for example contain the height and the weight of a person as features.

Height	Weight
1.72	85
1.92	92
1.55	52
1.62	61
1.7	71

You have probably already guessed that we used the metric system to depict the weight and the height of a person. For height we used meters, which can range (roughly) from 0.5 meters to 2.1 meters. For weight we used kilogram, which can range from 4kg to 120kg for an average person of different ages. If we used these units in training without any rescaling, our neural network might either take a very long time to converge or would not converge at all.

We can demonstrate this idea by looking at the example below, where we use the function f(x, y) = x^2 + y^2 undefined to construct the contour lines. This is a bowl shaped function that we observe from above. The different colors represent the different values of f(x, y) undefined . The darker the value, the lower the output and the closer we are to the optimium. The lowest value is at the point (0, 0) undefined . The contour lines have a circlular shape and are perfectly symmetrical, because both variables x undefined and y undefined have the same impact on the function output. Due to this symmetry, it does not matter what the starting x and y values we pick, we will move in a straight line towards the minimum.

What happens then if the variables have a non symmetrical impact on the output? Let us consider the function f(x, y) = x^2 + 9y^2 undefined . We get non symmetrical contour lines and a zigzagging effect. Gradient descent does not move the x and y values in a straight line, but oscilates in the y direction. When we move in the x undefined direction, we move by 2x * \alpha undefined , when we move into y direction, we move by 18y * \alpha undefined , so naturally there is a higher chance to overshoot towards y undefined .

In some cases the value could oscilate indefinetly and never converge to the optimum. This is the case for f(x, y) = x^2 + 10y^2 undefined with a learning rate of 0.1.

There is a couple of things we can do to reduce the chances of oscilating. We could for example use a lower learning rate for both variables, but that might slow down the learning process significantly. Or we could use a different learning rate for each feature, but tweaking many thousands of learning rates seems unfeasable. In practice we scale the input features by normalizing or standardizing the inputs. Those techniques bring the features on the same scale. It should not matter a lot which of the two procedures you employ, just keep in mind to scale your inputs before you start the training.

Normalization

Normalization, also called min-max scaling, transorms the features into a 0-1 range.

undefined

The largest value of a feature gets assigned a value of 1, the lowest value of a feature gets assigned the value of 0 while the rest of the values are scaled between 0 and 1.

When we apply normalization to the example above, we end up with the following feature values.

Height	Weight
0.46	0.82
1.00	1.00
0.00	0.00
0.19	0.23
0.41	0.47

Standardization

The standardization procedure, also called z-score normalization, produces feature values that have a mean \mu undefined of 0 and a standard deviation \sigma undefined of 1.

undefined

When we apply standardization to the example above we end up with the following features.

Height	Weight
0.09	0.55
1.11	0.85
-0.77	-0.86
-0.42	-0.48
-0.01	-0.05