Vision Transformer

We have mentioned before that the transformer has become the swiss army knive for the deep learning community. This transformer has a general purpose architecture that can be applied to many modalities. In this section we will discuss how we can use layers of transformer encoders for computer vision, the so called vision transformer[1] .

For the sake of explanation let's assume that we are dealing with images of size 4x4 pixels.

The naive approach would be to allow each pixels to attend to each other pixel in the image, but the computational cost for the calculation of attention would explode when the size of the image increases. Instead we divide the image into quadratic patches. For our small example we could create patches of size 2x2.

Remember that the transformer expects vector embeddings as inputs. To achieve that we first flatten the patches.

Next we project the flattened patches linearly, by running each individual vector through the same linear layer using the same weights and bias. This result can be compared to the token embeddings in the traditional transformer. We additionally create a separate learnable embedding (vector on the left in the image below). This embedding corresponds to the classification token and will be used at a later stage to classify our images.

At this point our model can not differentiate between the different positions of the vectors in the image. Therefore we create positional embeddings and add those to linear projections.

Finally we pass those embeddings through layers of encoders and allow each embedding to attend to each other embedding. At the output layer we ignore all vectors, except the one corresponding to the classification token. This token is used as the input to a fully connected layer and we train the whole network jointly on a classification task.