Data, Modules, Optimizers, Losses

In the last sections we have shown a very simple implementation of a neural network using PyTorch. In reality though PyTorch provides a lot of functionalities to make neural network training much more efficient and scalable. This section is dedicated to those functionalities.

Data

So far we have looked at very small datasets and were not necessarily concerned with how we would manage the data, but deep learning is dependent on lots and lots of data and we need to be able to store, manage and retrieve the data. When we retrieve the data we need to make sure, that we don't go beyond the capacity of our RAM or VRAM (Video RAM). PyTorch gives us a flexible way to deal with our data pipeline the way we see fit by providing the Dataset and the DataLoader classes.

from torch.utils.data import Dataset, DataLoader

The Dataset object is the PyTorch representation of data. When we are dealing with real world data we subclass the Dataset class and overwrite the __getitem__ and the __len__ methods. Below we create a dataset that contains a list of numbers, the size of which depends on the size parameter in the __init___ method. The __getitem__ method implements the logic, which determines how the individual element of our data should be returned given only the index of data.

class ListDataset(Dataset):
    def __init__(self, size):
        self.data = list(range(size))
    
    def __len__(self):
        return len(self.data)
    
    def __getitem__(self, idx):
        return self.data[idx]

We use the ListDataset to create a list with 100 elements from 0 to 99.

dataset = ListDataset(100)
print(len(dataset))
print(dataset[42])

 100
 42

In practice we could for example use the Dataset to load an image for the index received in the __getitem__ method. Below is a dummy implementation of such a Dataset.

class ImagesDataset(Dataset):
    def __init__(self, images_list):
        # list containing information about the image
        # "[/images/image0.jpg", "/images/image1.jpg]"
        self.images_list = images_list
    
    def __len__(self):
        return len(self.images_list)
    
    def __getitem__(self, idx):
        file = self.images_list[idx]
        image = open_image(file)
        return image

During the training process we only directly interact with the DataLoader object and not with the Dataset object. The goal of the DataLoader is to return data in batch sized pieces. Those batches can then be used for training or testing purposes. But what exaclty is a batch? The batch size tells us what proportion of the whole dataset is going to be used to calculate the graedients, before a single gradient descent step is taken.

The approach of using the whole dataset to calculate the gradient is called batch gradient descent. Using the whole dataset has the advantage that we get a good estimation for the gradients, yet in many cases batch gradient descent is not used in practice. We often have to deal with datasets consisting of thousands of features and millions of samples. It is not possible to load all that data on the GPU's. Even if it was possible, it would take a lot of time to calculate the gradients for all the samples in order to take just a single training step.

In stochastic gradient descent we introduce some stochasticity by shuffling the dataset randomly and using one sample at a time to calculate the gradient and to take a gradient descent step until we have used all samples in the dataset. The advantage of stochastic gradient descent is that we do not have to wait for the calculation of gradients for all samples, but in the process we lose the advantages of parallelization that we get with batch gradient descent. When we calculate the gradient based on one sample the calculation is going to be off. By iterating over the whole dataset the sum of the directions is going to move the weights and biases towards the optimum. In fact this behaviour is often seen as advantageous, because theoretically the imprecise gradient could potentially push a variable from a local minimum.

Mini-batch gradient descent combines the advantages of the stochastic and batch gradient descent. Insdead of using one sample at a time ,several samples are utilized to calculate the gradients. Similar to the learning rate, the the mini-batch is a hyperparameter and needs to be determined by the developer. Usually the size is calculated as a power of 2, for example 32, 64, 128 and so on. You just need to remember that the batch needs to fit into the memory of your graphics card. The calculation of the gradients with mini-batches can be parallelized, because we can distribute the samples on different cores of the CPU/GPU. Additionally it has the advantage that theoretically our training dataset can be as large as we want.

The DataLoader takes several arguments to control the above described details. The dataset argument expects a Dataset object that implements the __init__ and __getitem__ interface. The batch_size parameter determines the size of the mini-batch. The default value is 1, which is equal to stochastic gradient descent. The shuffle parameter is a boolean value, that detemines if the dataset will be shuffled at the beginning of the iteration process. The default value is False.

Let's generate a ListDataset with just 5 elements for demonstration purposes.

dataset = ListDataset(5)

We generate a DataLoader that shuffles the dataset object and returns 2 samples at a time.

dataloader = DataLoader(dataset=dataset, batch_size=2, shuffle=True)

Finally we iterate through the DataLoader and receive a batch at a time. Once only one object remains, a single element is returned.

for batch_num, data in enumerate(dataloader):
    print(f'Batch Nr: {batch_num+1} Data: {data}')

Batch Nr: 1 Data: tensor([4, 0])
Batch Nr: 2 Data: tensor([3, 1])
Batch Nr: 3 Data: tensor([2])

Often we want our batches to always be of equal size. If a batch is too small the calculation of the gradient might be too noisy. To avoid that we can use the drop_last argument. The drop_last parameter removes the last batch, if it is less than batch_size. The argument defaults to False

dataloader = DataLoader(dataset=dataset, batch_size=2, shuffle=True, drop_last=True)

When we do the same exercise again, we end up with fewer iterations.

for batch_num, data in enumerate(dataloader):
    print(f'Batch Nr: {batch_num+1} Data: {data}')

Batch Nr: 1 Data: tensor([0, 2])
Batch Nr: 2 Data: tensor([3, 4])

Each sample in the dataset is typically used several times in the training process. Each iteration over the whole dataset is called an epoch.

Info

An epoch is the time period, in which all the samples in the dataset have been iterated over and used for gradient calculations.

If we want to use several epochs in a training loop, all we have to do is to include an additional outer loop.

for epoch_num in range(2):
    for batch_num, data in enumerate(dataloader):
        print(f'Epoch Nr: {epoch_num + 1} Batch Nr: {batch_num+1} Data: {data}')

Epoch Nr: 1 Batch Nr: 1 Data: tensor([2, 1])
Epoch Nr: 1 Batch Nr: 2 Data: tensor([4, 0])
Epoch Nr: 2 Batch Nr: 1 Data: tensor([2, 4])
Epoch Nr: 2 Batch Nr: 2 Data: tensor([3, 1])

Oftentiems it is useful to get the next batch of data using a separate process, while we are still in the process of calculating the gradients. The num_workers parameter determines the number of workers, that get the data in parallel. The default is 0, which means that only the main process is used.

dataloader = DataLoader(dataset=dataset, batch_size=2, shuffle=True, drop_last=True, num_workers=4)

We won't notice the speed difference using such a simple example, but the speedup with large datasets might be noticable.

There are more parameters, that the DataLoader class provides. We are not going to cover those just yet, because for the most part the usual parameters are sufficient. We will cover the special cases when the need arises. If you are faced with a problem that requires more control, you can look at the PyTorch documentation.

Training Loop

The training loop that we implemented when we solved our circular problem works just fine, but PyTorch provides much better approaches. Once our neural network architectures get more and more complex, we will be glad that we are able to utilize a more efficient training approach.

import torch
import sklearn.datasets as datasets
from torch.utils.data import DataLoader, Dataset

This time around we explicitly set some parameters as constants. This time around we use a much higher number of samples and neurons, to demonstrate that PyTorch is able to handle those.

# parameters
DEVICE = ("cuda:0" if torch.cuda.is_available() else "cpu")
NUM_EPOCHS=10
BATCH_SIZE=1024
NUM_SAMPLES=1_000_000
NUM_FEATURES=10
ALPHA = 0.1

#number of hidden units in the first and second hidden layer
HIDDEN_SIZE_1 = 1000
HIDDEN_SIZE_2 = 500

We create a simple classification dataset with sklearn and construct a Dataset object.

class Data(Dataset):
    def __init__(self):
        X, y = datasets.make_classification(
            n_samples=NUM_SAMPLES, 
            n_features=NUM_FEATURES, 
            n_informative=7, 
            n_classes=2, 
        )

        self.X = torch.from_numpy(X).to(torch.float32)
        self.y = torch.from_numpy(y).to(torch.float32).view(-1, 1)

    def __len__(self):
        return len(self.X)

    def __getitem__(self, idx):
        return self.X[idx], self.y[idx]

dataset = Data()
dataloader = DataLoader(dataset=dataset, 
                              batch_size=BATCH_SIZE,
                              shuffle=True,
                              drop_last=True,
                              num_workers=4)

This time around we will start by looking at the desired product, the training loop, to understand what we need in order to make our code clean, modular and scalable. Instead of calculating one layer after another we will calculate our forward pass using a single call to the model. The model will contain all the matrix multiplications and activation functions needed to predict the probability that the features belong to a certain class. The criterion is essentially a loss function, in our case it is the binary cross-entropy. The optimizer loops through all parameters of the model and applies gradient descent when we call optimizer.step and clears all the gradients when we call optimizer.zero_grad().

def train(dataloader, model, criterion, optimizer):
    for epoch in range(NUM_EPOCHS):
        loss_sum = 0
        batch_nums = 0
        for batch_idx, (features, labels) in enumerate(dataloader):
            # move features and labels to GPU
            features = features.to(DEVICE)
            labels = labels.to(DEVICE)

            # ------ FORWARD PASS --------
            probs = model(features)

            # ------CALCULATE LOSS --------
            loss = criterion(probs, labels)

            # ------BACKPROPAGATION --------
            loss.backward()

            # ------GRADIENT DESCENT --------
            optimizer.step()

            # ------CLEAR GRADIENTS --------
            optimizer.zero_grad()

            # ------TRACK LOSS --------
            batch_nums += 1
            # detach() removes a tensor from a computational graph 
            # and cpu() move the tensor from GPU to CPU 
            loss_sum += loss.detach().cpu()

        print(f'Epoch: {epoch+1} Loss: {loss_sum / batch_nums}')

In order to make our calculations more modular, we will create a Module class. You can think about a module as a piece of a neural network. Usually modules are those pieces of a network, that we use over and over again. In essence you create a neural network by defining and stacking modules. As we need to apply affine transformations several times, we put the logic of a linear layer into a separate class and we call that class Module. This module initializes a weight matrix and a bias vector. For easier access at a later point we create an attribute parameters, which is just a list holding the weights and biases. We also implement the __call__ method, which contains the logic for the forward pass.

class Module:
    
    def __init__(self, in_features, out_features):
        self.W = torch.normal(mean=0, 
                              std=0.1, 
                              size=(out_features, in_features), 
                              requires_grad=True, 
                              device=DEVICE, 
                              dtype=torch.float32)
        self.b = torch.zeros(1, 
                             out_features, 
                             requires_grad=True, 
                             device=DEVICE, 
                             dtype=torch.float32)
        self.parameters = [self.W, self.b]
                
    def __call__(self, features):
        return features @ self.W.T + self.b

Our model needs an activation function, so we implement a sigmoid function.

def sigmoid(z):
    return 1 / (1 + torch.exp(-z))

The Model class is the abstraction of the neural network. We will need three fully connected layers, so the model initializes three linear modules. In the __call__ method we implement forward pass of the neural network. So when we call model(features), the features are processed by the neural network, until the last layer is reached. Additionally we implement the parameters method, which returns the full list of the parameters of the model.

class Model:
    
    def __init__(self):
        self.linear_1 = Module(NUM_FEATURES, HIDDEN_SIZE_1)
        self.linear_2 = Module(HIDDEN_SIZE_1, HIDDEN_SIZE_2)
        self.linear_3 = Module(HIDDEN_SIZE_2, 1)
        
    def __call__(self, features):
        x = self.linear_1(features)
        x = sigmoid(x)
        x = self.linear_2(x)
        x = sigmoid(x)
        x = self.linear_3(x)
        x = sigmoid(x)
        return x
    
    def parameters(self):
        parameters = [*self.linear_1.parameters, 
                      *self.linear_2.parameters,
                       *self.linear_3.parameters]
        return parameters

Below we test the forward pass with random numbers. Applying the forward pass of a predefined model should feel more intuitive than our previous implementations.

features = torch.randn(BATCH_SIZE, NUM_FEATURES).to(DEVICE)
model = Model()
output = model(features)

The optimizer class is responsible for applying gradient descent and for clearing the gradients. Ours is a simple implementation of stochastic (or batch) gradient descent, but PyTorch has many more implementations. We will study those in future chapters. Our optimizer class needs the learning rate (alpha) and the parameters of the model. When we call step() we loop over all parameters and apply gradient descent and when we call zero_grad() we clear all the gradients. Notice that the optimizer logic works independent of the exact architecture of the model, making the code more managable.

class SGDOptimizer:
    
    def __init__(self, parameters, alpha):
        self.alpha = alpha
        self.parameters = parameters
    
    def step(self):
        with torch.inference_mode():
            for parameter in self.parameters:
                parameter.sub_(self.alpha * parameter.grad)
                
    def zero_grad(self):
        with torch.inference_mode():
            for parameter in self.parameters:
                parameter.grad.zero_()

Finally we implement the loss function. Once again the calculation of the loss is independent of the model or the optimizer. When we change one of the components, we do not introduce any breaking changes. If we replace the cross-entropy by mean squared error, our training loop will still keep working.

def bce_loss(outputs, labels):
    loss =  -(labels * torch.log(outputs) + (1 - labels) * torch.log(1 - outputs)).mean()
    return loss

Now we have all components, that are required by our training loop.

model = Model()
optimizer = SGDOptimizer(model.parameters(), ALPHA)
criterion = bce_loss

train(dataloader, model, criterion, optimizer)

Epoch: 1 Loss: 0.44153448939323425
Epoch: 2 Loss: 0.26614147424697876
Epoch: 3 Loss: 0.1991310715675354
Epoch: 4 Loss: 0.16552086174488068
Epoch: 5 Loss: 0.14674726128578186
Epoch: 6 Loss: 0.13339845836162567
Epoch: 7 Loss: 0.12402357161045074
Epoch: 8 Loss: 0.11728055775165558
Epoch: 9 Loss: 0.11224914342164993
Epoch: 10 Loss: 0.1082562804222107

You can probaly guess, that PyTorch provides classes and functions, that we implemented above, out of the box. The PyTorch module torch.nn contains most of the classes and functions, that we will require.

import torch.nn as nn

When we write custom PyTorch modules we need to subclass nn.Module. We need to putall trainable parameters into the nn.parameter.Parameter() class. This tells PyTorch to put those tensors into the parameters list (which is used by the optimizer) and the tensors are automatically tracked for gradient computation. Instad of defining __call__ as we did before, we define the forward method. PyTorch calls forward automatically, when we call the module object. You must never call this method directly, as PyTorch does additional calculations during the forward pass, so instead of using module.forward(features) use module(features).

class Module(nn.Module):
    def __init__(self, in_features, out_features):
        super().__init__()
        self.W = nn.parameter.Parameter(torch.normal(mean=0, std=0.1, 
                              size=(out_features, in_features)))
        self.b = nn.parameter.Parameter(torch.zeros(1, out_features))

    def forward(self, features):
        return features @ self.W.T + self.b

The great thing about PyTorch modules is their composability. Earlier created modules can be used in subsequent modules. Below for example we use the above defined Module class in the Model module. In later chapter we will see how we can create blocks of arbitrary complexity using this simple approach.

class Model(nn.Module):
    def __init__(self):
        super().__init__()
        self.linear_1 = Module(NUM_FEATURES, HIDDEN_SIZE_1)
        self.linear_2 = Module(HIDDEN_SIZE_1, HIDDEN_SIZE_2)
        self.linear_3 = Module(HIDDEN_SIZE_2, 1)
        
    def forward(self, features):
        x = self.linear_1(features)
        x = torch.sigmoid(x)
        x = self.linear_2(x)
        x = torch.sigmoid(x)
        x = self.linear_3(x)
        x = torch.sigmoid(x)
        return x

PyTorch obviously provides loss functions and optimizers. We will use BCELoss, which calculates the binary cross-entropy loss. Optimizers are located in torch.optim. For now we will use stochastic gradient descent, but there are many more optimizers that we will encounter soon.

model = Model().to(DEVICE)
criterion = nn.BCELoss()
optimizer = torch.optim.SGD(model.parameters(), lr=ALPHA)

train(dataloader, model, criterion, optimizer)

Epoch: 1 Loss: 0.4358866512775421
Epoch: 2 Loss: 0.26300883293151855
Epoch: 3 Loss: 0.1951223760843277
Epoch: 4 Loss: 0.16517716646194458
Epoch: 5 Loss: 0.14785249531269073
Epoch: 6 Loss: 0.1351807564496994
Epoch: 7 Loss: 0.12569186091423035
Epoch: 8 Loss: 0.11819736659526825
Epoch: 9 Loss: 0.11242685467004776
Epoch: 10 Loss: 0.10799615830183029

PyTorch provides a lot of modules out of the box. An affine/linear transformation layer is a common procedure, therefore you should use nn.Linear instead of implementing your solutions from scratch.

class Model(nn.Module):
    def __init__(self):
        super().__init__()
        self.linear_1 = nn.Linear(NUM_FEATURES, HIDDEN_SIZE_1)
        self.linear_2 = nn.Linear(HIDDEN_SIZE_1, HIDDEN_SIZE_2)
        self.linear_3 = nn.Linear(HIDDEN_SIZE_2, 1)
    
    def forward(self, features):
        x = self.linear_1(features)
        x = torch.sigmoid(x)
        x = self.linear_2(x)
        x = torch.sigmoid(x)
        x = self.linear_3(x)
        x = torch.sigmoid(x)
        return x

model = Model().to(DEVICE)
criterion = nn.BCELoss()
optimizer = torch.optim.SGD(model.parameters(), lr=ALPHA)

train(dataloader, model, criterion, optimizer)

Epoch: 1 Loss: 0.46121323108673096
Epoch: 2 Loss: 0.345653235912323
Epoch: 3 Loss: 0.26799750328063965
Epoch: 4 Loss: 0.20885568857192993
Epoch: 5 Loss: 0.16782595217227936
Epoch: 6 Loss: 0.14582592248916626
Epoch: 7 Loss: 0.1313050240278244
Epoch: 8 Loss: 0.12312141805887222
Epoch: 9 Loss: 0.11707331985235214
Epoch: 10 Loss: 0.11287659406661987

To finish this chapter let us discuss an additional PyTorch convenience.You might have noticed, that all modules and activation functions are called one after another, where the output of one module (or activation) is used as the input into the next. In that case we can pack all modules and activations into a nn.Sequential object. When we call that object, the components will be executed in a sequential order.

class Model(nn.Module):
    def __init__(self):
        super().__init__()
        self.layers = nn.Sequential(
                nn.Linear(NUM_FEATURES, HIDDEN_SIZE_1),
                nn.Sigmoid(),
                nn.Linear(HIDDEN_SIZE_1, HIDDEN_SIZE_2),
                nn.Sigmoid(),
                nn.Linear(HIDDEN_SIZE_2, 1),
            )
    
    def forward(self, features):
        return self.layers(features)

model = Model().to(DEVICE)
criterion = nn.BCELoss()
optimizer = torch.optim.SGD(model.parameters(), lr=ALPHA)

train(dataloader, model, criterion, optimizer)

Epoch: 1 Loss: 0.4605180025100708
Epoch: 2 Loss: 0.3372548818588257
Epoch: 3 Loss: 0.27341559529304504
Epoch: 4 Loss: 0.22028055787086487
Epoch: 5 Loss: 0.17632894217967987
Epoch: 6 Loss: 0.15047569572925568
Epoch: 7 Loss: 0.1337045431137085
Epoch: 8 Loss: 0.12339214235544205
Epoch: 9 Loss: 0.11565018445253372
Epoch: 10 Loss: 0.11087213456630707