BERT
When the original transformer paper was released in the year 2017, it was not clear what tremendous impact that architecture would have on deep learning. Slowly but surely researchers from different research labs begun to release transformer-based architectures and the more time passed, the more areas were conquered by transformers. One of the first models that garnered a lot of attention was BERT from Google[1] . BERT is a pre-trained language model that allowes practitioners to fine-tune the model to their specific needs. Nowadays BERT (and its relatives) is de facto standard tool that is used for transfer learning in the area of natural language processing.
Architecture
BERT is short for Biderectional Encoder Representation from Transformers. We can infer from the name, that the model architecture consists solely from a stack of transformer encoders, without a decoder.
The original BERT paper introduced two models. The BERT-Base model consists of 12 encoder layers, each with 12 attention heads and 768 hidden units for each token. BERT-Large on the other hand uses 24 layers with 16 heads and 1024 hidden units.
Each layer takes a certain amount of tokens and outputs the same number of tokens. The number of tokens never changes between the encoder layers.
The biderectional part means, that the encoder can pay attention to all tokens contained in the sequence. For that reason BERT is primarily used for tasks, that have access to the full sequence at inference time. For example in a classification task we process the whole sequence in order to classify the sentence.
In the next section we will additionally encouter the GPT family of models, that are created by stacking transformer decoders. A GPT like decoder creates output tokens, based on all current or previous input embeddings/tokens.
Pre-Training
BERT was pre-trained jointly on two different objectives: masked language model and next section prediction.
In a masked language model, we replace some parts of the original sequence randomly with a special [MASK] token and the model has to predict the missing word. Let's for example look at the below sentence to understand, why this task matters for pretraining.
today i went to a [MASK] to get my hair done
While you probably have a couple of options to fill out the masked word, your options are still limited by logic and the rules of the english language. The word "hairdresser" is probably the most likely option, but the word "salon" or even "friend" are also valid options. When a model learns to replace [MASK] by a valid word, that shows that in the very least, the model learned the basic statistics that govern the english language. And those statistics are useful not only for the task at hand, but also for many other tasks that require natural language understanding.
In the next section prediction task, the model faces two sequences and has to predict if the second sequence is the logical continuation of the first sequence.
The two sentences below seem to have a logical connection, so we expect the model to return true.
The next two sentences on the other hand are unrelated and the model should return false.
Similarly to the masked model, solving this task is related to understanding the english language. Once a model is competent at solving this task, we can assume that it has learned some important statistics of the language and those statistics can be used for downstream tasks.
Let's for a second assume, that we are training on two sentences, each consisting of 2 words (2 tokens).
We first prepend the two sentences with the [CLS] token. Once this token is processed by a stack of encoders, it is used for the binary classification to determine if the second sentence should follow the first sentence: the next section prediction task. We separate the two sentences by the [SEP] token and additionally append this token at the end of the sentence. We mask out some of the words for the masked language model. Those masked tokens are also processed by layers of encoders and are used to predict the correct token that was masked out. Both losses are aggregated for the gradient descent step.
It is important to mention that both those tasks are trained using self-supervised learning. We do not require anyone to collect and label a dataset. We can for example use Wikipedia articles and pick out random consecutive sentences and mask out some of the tokens. This gives us a huge dataset for pre-training.
Fine-Tuning
When we have a labeled language dataset, we can use BERT for fine-tuning. BERT can be used How the pre-trained BERT model can be used for fine-tuning depends on the task at hand.
Let's assume we are dealing with a classification task, like sentiment analysis. The first token, CLS, is the start token and is generally used for classification tasks. We can use the embedding for the class token from the last encoder layer as an input into a classification layer and ignore the rest.
BERT is designed for fine-tuning, so it makes no sense to train the model from scratch. Instead we will use the pre-trained BERT weights to solve our task at hand. Nowadays the most efficient way to use BERT is with the help of the 🤗HuggingFace. HuggingFace includes models, pretrained weights, datasets and much more. We will make heavy use of it in future, and not only for natural language processing.
import numpy as np
import torch
from transformers import pipeline
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForSequenceClassification
from transformers import Trainer, TrainingArguments
import evaluate
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
batch_size = 64
model_ckpt = "bert-base-uncased"
dataset_name = "sst2"
dataset = load_dataset(dataset_name)
tokenizer = AutoTokenizer.from_pretrained(model_ckpt)
tokenize = lambda batch: tokenizer(batch["sentence"], padding=True, truncation=True)
tokenized_dataset = dataset.map(tokenize, batched=True, batch_size=None)
model = AutoModelForSequenceClassification.from_pretrained(model_ckpt).to(device)
metric_name = "accuracy"
metric = evaluate.load(metric_name)
def compute_metrics(pred):
logits, labels = pred
predictions = np.argmax(logits, axis=-1)
return metric.compute(predictions=predictions, references=labels)
training_args = TrainingArguments(
output_dir="bert",
num_train_epochs=1,
learning_rate=1e-5,
per_device_train_batch_size=batch_size,
per_device_eval_batch_size=batch_size,
evaluation_strategy="epoch",
)
trainer = Trainer(
model=model,
args=training_args,
train_dataset=tokenized_dataset["train"],
eval_dataset=tokenized_dataset["validation"],
tokenizer=tokenizer,
compute_metrics=compute_metrics,
)
trainer.train()