Geometric Interpretation

So far we have discussed how a computational graph works and how we can use autodiff packages to solve nonlinear problems. Yet we are still missing a crucial component that will allow us to understand the workings of neural networks. We need to answer the following question. How does a solution to a nonlinear problem looks geometrically?

To get to that understanding, first we have to understand, that a neural network is a series of transormations. Multiplying the features with a matrix is a linear transformation, adding a bias is a translation and applying an activation function squishes the data. Each of these transformations accomplishes a different task and can be interpreted visually. Additionally we can stack several layers in a neural network, thus creating a transformation composition.

Info

A neural network is a composition of transformations.

To demonstrate the visual interpretation of a transformation we will utilize the following matrix of features with 2 features and 4 samples.

undefined

The four samples build a square in a 2d coordinate system.

We will start with matrix multiplications.

Info

A matrix multiplication is a linear transformation.

While there is a formal definition of linear transformations, we could use a somewhat loose definition that you can use as a mental model. In a linear transformation parallel lines remain parallel and the origin does not move. So the four lines of the square above will remain parallel lines after the linear transformation.

We can using linear transformations by multiplying the features matrix \mathbf{X} undefined by the weight matrix \mathbf{W}^T undefined , our transformation matrix. Depending on the contents of \mathbf{W}^T undefined different types of transformations are produced. The weight matrix is going to be a 2x2 matrix for now. That way we are using 2 features per sample as input and generate 2 transformed features per sample as output.

The identity matrix is the easiest matrix to understand.

undefined

Applying this transformation keeps the original matrix.

If we change the values of the identity matrix slightly, we scale the original square. The matrix \mathbf{W}^T = \begin{bmatrix} 2 & 0 \\ 0 & 1 \\ \end{bmatrix} undefined for example scales the input square in the x direction by a factor of 2.

The matrix \mathbf{W}^T = \begin{bmatrix} 1 & 0 \\ 0 & 0.5 \\ \end{bmatrix} undefined on the other hand scales the matrix in the y direction by a factor of 0.5.

So far we have used only one diagonal of the matrix to scale the square. The other diagonal can be used for the so called sheer operation. When we use the below matrix \mathbf{W}^T = \begin{bmatrix} 1 & 0 \\ 1 & 1 \\ \end{bmatrix} undefined for example, the top and the bottom lines are moved right and left respectively.

The matrix \mathbf{W}^T = \begin{bmatrix} 1 & 1 \\ 0 & 1 \\ \end{bmatrix} undefined on the other hand move the left and the right lines to the bottom or top respectively.

We can combine scaling and sheering to achieve interesting transformations. The matrix \mathbf{W}^T = \begin{bmatrix} \cos(1) & -\sin(1) \\ \sin(1) & \cos(1) \\ \end{bmatrix} undefined for example rotates the data.

Next let's look at the visual interpretation of the bias.

Info

Bias addition is a translation.

A bias allows us to translate the data. That means that each point is equally moved. The vector \mathbf{b} = \begin{bmatrix} 1 \\ 0 \\ \end{bmatrix} undefined would move all points in the x direction by 1.

The vector \mathbf{b} = \begin{bmatrix} 0 \\ 1 \\ \end{bmatrix} undefined on the other hand, moves all points by 1 in the y direction.

A translation is not a linear transformation. If we apply a linear transformation, that induces rotation, the zero point remains intact after the transformation.

A translation on the other hand moves the origin.

A neural network combines a liner transformation with a translation. In linear algebra this transformation combination is called affine transformation.

Let's finally move to activation functions.

Info

A nonlinear activation function squishes the data.

We can imagine the nonlinear transformations as some sort of "squishing", where the activation function limits the data to a certain range. The sigmoid that we have utilized so far pushes the vectors into a 1 by 1 box.

The ReLU activation function is even wilder. The function turns negative numbers into zeros and leaves positive numbers untouched. With a ReLU parallel lines do not necessarily stay parallel.

There is a final remark we would like to make in regards with linear transformations. So far we have used a 2x2 weight matrix for our linear transformations. We made this in order to keep the number of dimensions constant. We took in two features and produced two neurons. That way we could visualize the results in a 2d plane. If on the other hand we used a 2x3 matrix, the transformation would have pushed the features into 3d space. In deep learning we change the amount of dimensions all the time by changing the number of neurons from layer to layer. Sometimes the network can find a better solution in a different dimension.

So what exactly does a neural network try to achieve throught those transformations? We are going to use a slightly different architecture to solve our circular data problem. The architecture below was not picked randomly, but to show some magic that is hidden under the hood of a neural network.

Let us remember that logistic regression is able to deal with classification problems, but only if the data is linearly separable. The last layer of the above neural network looks exacly like logistic regression with two input features. That must mean, that the neural network is somehow able to extract features through linear and nonlinear transformations, that are linearly separable.

The example below shows how the neural network learns those transformations. On one side you can see the original inputs with the learned decision boundary, on the other side are the two extracted features that are used as inputs into the output layer. When the neural network has learned to separate the two circles, that means that the two features from the last hidden layer are linearly separable. Start the example and observe the learning process. At the beginning the hidden features are clustered together, but after a while you will notice that you could separate the different colored circles by a single line.

Cross-Entropy: 0.000000

It is not always clear how the neural network does those transformations, but we could use the example above to get some intuition for the process. If you look at the original circular data again you might notice something peculiar. Imagine the data is actually located in 3d space and you are looking at the data from above. Now imagine that the blue and the red dots are located on different heights (z-axis). Wouldn't that mean that you could construct a 2d plane in 3d space to linearly separate the data? Yes it would. The first hidden layer of our neural network transforms the 2d data into 4d data. Afterwards we move the processed features back into 2d space.

Modern neural networks have hundreds or thousands of dimensions and hidden layers and we can not visualize the hidden features to get a better feel for what the neural network does. But generally speaking we can state the folllowing.

Info

Affine transformations move, scale, rotate the data and move it between different dimensions. Activation functions squish or restraint the data to deal with nonlinearity. The last layers contain the hidden features, that can be linearly separated to solve a particular problem.

Try to keep this intuition in mind while you move forward with your studies. It is easy to forget.