GPT

GPT, short for generative pre-training , is a family of models developed by researchers at OpenAI. GPT is a decoder based transformer model, without any encoder interaction. We simply stack layers of decoders on top of each other.

Decoder Layer Decoder Layer Decoder Layer Decoder Layer Decoder Layer Decoder Layer

Due to the lack of encoders in the GPT architecture we do not need any cross-attention. This simplifies the decoder layer to just two sublayers: masked multihead attention and position-wise feed-forward neural network.

Add & Norm P.w. Feed Forward Add & Norm Multihead Atention Masked

The training objective is of GPT is quite simple. Given some tokens from a sentence, predict the next token. For example if the GPT is given the three words what is your, the stack of decores should predict words like name, age or weight.

what is your name

Unline BERT, GPT is unidirectional and we need to maks out future words. The tokens must never pay attention to future tokens, as that would contaminate the training process by allowing the tokens to attend to tokens that they are actually trying to predict. So if we want to predict the fourth token based on the third embedding that was processed by layers of decoders, the third embedding can attend to all previous tokens, including itself, but not the token it tries to predict.

GPT Family T_1 T_1 T_2 T_2 T_3 T_3 T_4 T_4 T_5 T_5 T_6 T_6

Based on the pre-trainig task, it is obvious that GPT can be used for text generation. You provide the model with a starting text and the model generates word after word to finish your writing, using the previously generated words as input in a recursive manner.

GPT is not a single model, but a family of models. By now we have GPT-1[1] , GPT-2[2] , GPT-3[3] and GPT-4. With each new iteration OpenAI increased the size of the models and the datasets that the models were trained on. It became clear that you could scale up transformer-like models through size and data and the performance would improve. Unfortunately for the deep-learning community OpenAI decided starting with GPT-2 not to release the pre-trained models to the public due to security concerns and profit considerations. While the weights of GPT-2 were released eventually, the public has no direct access to GPT-3 or GPT-4. Only throught the OpenAI api can you interract with newest GPT models. Luckily there are companies, like EleutherAI, that attemt to replicate the OpenAI GPT-3/GPT-4 models. At the moment of writing their largest model, GPT-NeoX-20B, consists of 20 billion parameters, but they plan to train even larger models to match the performance of the newest models by OpenAI.

We can use the transformers library by HuggingFace to interract with GPT-2. The easiest way to accomplish that is to use the text-generation pipeline. A pipeline abstracts away most of code running in the background. We do not need to take care of the tokenizer or the model, just by using 'text-generation' as the input, HuggingFace downloads GPT-2 weights and allows you to generate text.

from transformers import pipeline

generator = pipeline("text-generation")
prompt = (
    "In the year 2035 humanity will have created human level artificial intelligence."
)
outputs = generator(prompt, max_length=100)

When we use the prompt "In the year 2035 humanity will have created human level artificial intelligence." we might get the following result.

In the year 2035 humanity will have created human level artificial intelligence. Some experts believe that the first phase of AI could arrive around 2030 and be capable of being applied in a multitude of applications, including transportation, health or the arts.\n\nAs humans live longer and more efficiently and interact more quickly with computers, we will probably see a gradual step towards being a multi-systemed society. A society which will have more people involved and will focus on information management, education, marketing, governance,

This is relatively cohesive, but the results will depend on your initial prompt and will change each time you run the code.

References

  1. Alec Radford, Karthik Narasimhan, Tim Salimans, Ilya Sutskever. Improving Language Understanding by Generative Pre-Training. (2018).
  2. Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever. Language Models are Unsupervised Multitask Learners. (2019).
  3. Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, et. al. Language Models are Few-Shot Learners. (2020).