VQ-VAE

The vector quantised variational autoencoder [1] , VQ-VAE, was developed by DeepMind in 2017. While in deep learning terms this paper is considered to be quite old, the techniques that we learn in this section are still relevant to this day. Modern text to image generative models rely heavily on VQ-VAE.

The major idea of the paper is to quantise encoder outputs, which means that the latent variables produced by the encoder are transformed into discrete values. For that purpose we require a so called codebook.

The codebook contains K undefined embeddings of dimensionality D undefined . Encoder outputs z_e(x) undefined are compared to each individual embedding and the index k undefined of the nearest neighbour is used as the quantised value:k = \arg\min_j ||z_e(x) - e_j || undefined . Let's look at a stylized example of a convolutional VQ-VAE and try to understand what exactly that means.

Let's assume for simplicity, that the encoder produces a 5x5x3 image. The number of channels has to match the embedding dimensionalityK undefined , because we essentially compare each of the embeddings from the codebook to each of the vectors along the channel dimension. So in our case the codebook has vectors of size 3. For each pixel location, the quantised (blue) matrix contains the index of the embedding, that is closest to the encoder output at that particular pixel location. The decoder takes that quantised matrix and uses the corresponding embeddings from the codebooks as input, instead of the encoder output.

This lookup approach poses a problem, as we can not use backpropagation from the decoder input to the encoder output. Instead we assume that the gradients of the decoder input and the decoder output are similar and simply copy them from the decoder to the encoder.

The loss function consists of three parts. First we measure the usual reconstruction loss between the intputx undefined and the decoder output z_q(x) undefined . Second we measure the VQ loss, as the difference between the frozen encoder outputs and the embeddings. Basically we try to move embeddings closer to the encoder outputs. Lastly we measure the commitment loss as the difference between frozen embeddings and encoder outputs. That makes sure that the encoder commits to a particular embedding and does not grow out of proportions.

undefined

VQ-VAE

Autoregressive Prior - PixelCNN

VQ-VAE2