Convolutional Neural Networks

Let's start this section with the following question.

"Why is a fully connected neural network not the ideal tool for computer vision tasks?"

Let's assume for a second, that we look at an image of a digit. If you look at the digit directly, you will have no problem recognizing the number. But when you interract with the example and flatten the image (as we did with MNIST so far), the task gets a lot harder. Yet that is exactly the problem our fully connected neural network has to face.

The loss of all spatial information makes our model also quite sensitive to different types of transformations: like translation, rotation, scaling, color and lightning. The image below is shifted sligtly to the right and to the top. When you compare the two flattened images you will notices, that there is hardly any overlap in pixel values, even thought we are dealing with an almost identical image.

Even if we didn't lose any spatial information, the combination of a fully connected neural network and images is problematic. The neural network below processes a flattened greyscale image of size 4x4 pixels. You can hardly argue that that is an image at all, yet the 16 inputs and the ten neurons in the hidden layer require 160 weights and 10 biases and the output neuron requires 11 more parameters.

Real-life images are vastly larger than that and we require a neural network with hundreds or thousands of neurons and several hidden layers to solve even a simple task. Even for an image of size 100x100 and 100 neuron we are dealing with 1,000,000 weights. Training fully connected neural networks can become extremely inefficient.

Given those problems, we need a new neural network architecture. An architecture that is able to deal with images and video without destroying spatial information and requires fewer learnable parameters at the same time. The neural network that would alleviate our problems is called a convolution neural network, often abbreviated as CNN or ConvNet.

Convolutional Layer

Let's take it one step at a time and think about how we could design such a neural network. We start with a basic assumption.

Info

Pixels in a small region of an image are highly correlated.

Look at the image below. If you look at any pixel of that image, then with a very high probability the connecting pixels that surround that location are going to be part of the same object and will exhibit similar color values. Pixels that are part of the sky are surrounded by other sky pixels and mountain pixels are surrounded by other mountain pixels.

Sky, mountains and sea — *Source: Alex Holzreiter, Unsplash*

In order to somehow leverage the spatial correlation that is contained in a local patch of pixels we could construct a neural network that limits the receptive field of each neuron.

Info

The receptive field of a neuron describes the area of an image that a neuron has access to.

In a convolutional layer a neuron gets assigned a small patch of the image. Below for example the first neuron in the first hidden layer would focus only on the top left corner of the input image.

In a fully connected neural network a neuron had to be connected to all input pixels (hence the name fully connected). If we limit the number of pixels to a local patch of 2x2, that reduces the number of weights for a single neuron from 28*28 (MNIST dimensions) to just four. This is called sparse connectivity.

Each neuron is calculated using a different patch of pixels and you can imagine that those calculations are conducted by using a sliding window on the input image. The output neurons are placed in a way that keeps the spatial structure of the image. For example the neuron that has the upper left corner in its receptive field is located in the upper left corner of the hidden layer. The neuron that attends to the patch that is to the right of the upper left corner, is put to the right of the before mentioned neuron. When the receptive field moves a row below, the neurons that attend to that receptive field also move below. This results in a new two dimensional image. You can start the interactive example below and observe how the receptive field moves and how the neurons are placed in a 2D grid. Notice also that the output image shrinks. This is expected, because a 2x2 patch is required to construct a single neuron.

You have a lot of control over the behaviour of the receptive field. You can for example control the size of the receptive field. Above we used the window of size 2x2, but 3x3 is also a common size.

The stride is also a hyperparameter you will be interested in. The stride controls the number of steps the receptive field is moved. Above the field was moved 1 step to right and 1 step below, which corresponds to a stride of 1. In the example below we use a stride of 2. A larger stride obviously makes the output image smaller.

As you have probability noticed, the output image is always smaller than the input image. If you want to keep the dimensionality between the input and ouput images consistent, you can pad the input image. Basically that means that you add artificial pixels by surrounding the input image with zeros.

When it comes to the actual calculation of the neuron values, we are dealing with an almost identical procedure that we used in the previous chapters. Let's asume we want to calculate the activation value for the patch in the upper left corner.

The patch \begin{bmatrix} x_{11} & x_{12} \\ x_{21} & x_{22} \end{bmatrix} undefined is 2\times2 undefined , therefore we need exactly 4 weights. \begin{bmatrix} w_{11} & w_{12} \\ w_{21} & w_{22} \end{bmatrix} undefined

This collection of weights that is applied to a limited receptive field is called a filter or a kernel.

Similar to a fully connected neural network we calcualate a weighted sum, add a bias and apply a non-linear activation function to get the value of a neuron in the next layer.

undefined

What is unique about convolutional neural networks is the weight sharing among all neurons. When we slide the window of the receptive field, we do not replace the weights and biases, but always keep the same identical filter \begin{bmatrix} w_{1} & w_{2} \\ w_{3} & w_{4} \end{bmatrix} undefined . Weight sharing allows a filter to be translation invariant, which means that a filter learns to detect particular features (like edges) of an image independent of where those features are located.

The image that is produced by a filter is called a feature map. Essentially a convolutional operation uses a filter to map an input image to an output image that highlights the features that are encoded in the filter.

In the example below the input image and the kernel have pixel values of -1, 0 or 1. The convolution layer produces positive values when a sufficient amount of either positive or negative numbers overlap. In our case the filter and the image only sufficiently overlap on the right edge. Remember that we are most likely going to apply a ReLU non-linearity, which means that most of those numbers are going to be set to 0.

Different filters would generate different types of overlaps and thereby focus on different features of an image. Using the same image, but a different filter produces a feature map, that hightlights the upper edge.

Usually we want a convolutional layer to calculate several feature maps. For that purpose a convolution layer learns several filters, each with with different weights and bias. The result of a convolutional layer is therefore not a single 2d image, but a 3d cube.

Similarly we will not always deal with 1-channel greyscale images. Instead we will either deal with colored images or with three dimensional feature maps that come from a previous convolutional layer. When we are dealing with several channels as inputs, our filters gain a channel dimension as well. That means that each neuron attends to a 3 dimensional receptive field. Below for example the receptive field is 3x3x3 which in turn requires a filter with 27 weights.

Pooling Layer

While a convolution layer is more efficient than a fully connected layer due to sparse connectivity and weight sharing, you can still get into trouble when you are dealing with images of high resolution. The requirements on your computational resources can grow out of proportion. The pooling layer is intended to alleviate the problem by downsampling the image. That means that we use a pooling layer to reduce the resolution of an image.

The convolutional layer downsamples an image automatically. If you don't use padding when you apply the convolutional operation, your image is going to shrink, especially if you use a stride above 1. The pooling layer does that in a different manner, while requiring no additional weights at all. That makes the pooling operation extremely efficient.

Similar to a convolutional layer, a pooling layer has a receptive field and a stride. Usually the size of the receptive field and the stride are identical. If the receptive field is 2x2 the stride is also 2x2. That means each output of the pooling layer attends to a unique patch of the input image and there is never an overlap.

The pooling layer applies simple operations to the patch in order to downsample the image. The average pooling layer for example calculates the average of the receptive field. But the most common pooling layer is probably the so called max pooling. As the name suggest, the pooling operation only keeps the largest value of the receptive field. Below we provide an interactive example of max pooling in order to make the explanations more intuitive.

There is one downside to downsampling though. While you make your images more managable by reducing the resolution, you also lose some spatial information. The max pooling operation for example example only keeps one of the four values and it is impossible to determine at a later stage in which location the value was stored. Pooling is often used for image classification and works generally great, but if you can not afford to lose spatial information, you should avoid the layer.

Hierarchy of Features

A neural network architecture, that is based on convolutional layers often has a very familiar procedure. First we take an image with a low number of channels and apply a convolutional layer to it. That procedure results in a stack of feature maps, let's say 16. We can regard the number of produced feature maps as a channel dimension, so that now we are faced with an image of dimension (16, W, H). As we know how to apply a convolution layer to an image with many channels, we can stack several convolutional layers. The dimension of channels grows (usually as of power of 2: 16, 32, 64, 128 ...) as we move forward in the convolutional neural network, while the width and height dimensions shrink either naturally by avoiding padding or through pooling layers. Once the number of feature maps has grown sufficiently and the width and height of images has shrunk dramatically, we can flatten all the feature maps and use a fully connected neural network in a familar manner.

This stacking of convolutional neural networks and the growing number of feature maps is usually attributed the unbelievable success of ConvNets. In the first layer the receptive field is limited to a small area, therefore the network learns local features. As the number of layers grows, the subsequent layers start to learn features of features. Because of that, subsequent layers will attend to a larger area of the original image. If the first neuron in the first layer attends to four pixels in the upper left corner, the first neuron in the second layer will attend to features build on the 16 pixels of the original image (assuming a stride of 2). This hierarchical structure of feature detectors allows to find higher and higher level features, going for example from edges and colors to distinct shapes to actual objects. By the time we arrive at the last convolutional layer, we usually have more than 100 feature maps, each theoretically containing some higher level feature. Those features would be able to answer questions like: "Is there a nose?" or "Is there a tail?" or "Are there whiskers?". That is why the first part of a convolutional neural network is often called a feature extractor. The last fully connected layers leverage those features to predict the class of an image.

Below we present a convolutional neural network implemented in PyTorch. The convolutional layer and the pooling layer are implemented in the nn.Conv2d() and nn.MaxPool2d() respectively. We separate the feature extractor and the classifier steps into individual nn.Sequential() modules, but theoretically you could structure the model any way you desire.

class Model(nn.Module):

    def __init__(self):
        super().__init__()
        self.feature_extractor = nn.Sequential(
            nn.Conv2d(in_channels=1, out_channels=16, kernel_size=2, padding=0),
            nn.ReLU(),
            nn.MaxPool2d(kernel_size=2),
            nn.Conv2d(in_channels=16, out_channels=32, kernel_size=2, padding=0),
            nn.ReLU(),
            nn.MaxPool2d(kernel_size=2),
            nn.Conv2d(in_channels=32, out_channels=64, kernel_size=2, padding=0),
            nn.ReLU(),
            nn.MaxPool2d(kernel_size=2),
        )
                
        self.classifier = nn.Sequential(
            nn.Flatten(),
            nn.Linear(256, 100),
            nn.ReLU(),
            nn.Linear(100, 10)
        )
                
    def forward(self, features):
        features = self.feature_extractor(features)
        logits = self.classifier(features)
        return logits

If you ask yourself where the number 256 in the first linear layer comes from, this is the number of neurons that remain after the last max pooling operation. There is an explicit formula to calculate the size of your feature maps and you can read about it in the PyTorch documentation, but it is usually much more convenient to create a dummy input, pass it though your feature extractor and to deduce the number of features.

X = torch.randn(32, 1, 28, 28)
model = Model()
with torch.inference_mode():
    print(model.feature_extractor(X).shape)

torch.Size([32, 64, 2, 2])

Above for example we assume that we are dealing with the MNIST dataset. Each image is of shape (1, 28, 28) and the batch size is 32. After the input is processed by the feature extractor, we end up with a dimension of (32, 64, 2, 2), which means that we have a batch of 32 images consisting of 64 channels, each of size 2x2. When we multiply 64x2x2 we end up with the number 256.