Data, Modules, Optimizers, Losses
In the last sections we have shown a very simple implementation of a neural network using PyTorch. In reality though PyTorch provides a lot of functionalities to make neural network training much more efficient and scalable. This section is dedicated to those functionalities.
Data
So far we have looked at very small datasets and were not necessarily
concerned with how we would manage the data, but deep learning is
dependent on lots and lots of data and we need to be able to store,
manage and retrieve the data. When we retrieve the data we need to make
sure, that we don't go beyond the capacity of our RAM or VRAM (Video
RAM). PyTorch gives us a flexible way to deal with our data pipeline the
way we see fit by providing the Dataset
and the
DataLoader
classes.
from torch.utils.data import Dataset, DataLoader
The Dataset
object is the PyTorch representation of data.
When we are dealing with real world data we subclass the
Dataset
class and overwrite the __getitem__
and the
__len__
methods. Below we create a dataset that contains a list of numbers, the
size of which depends on the size parameter in the
__init___
method. The __getitem__
method implements the logic, which determines
how the individual element of our data should be returned given only the
index of data.
class ListDataset(Dataset):
def __init__(self, size):
self.data = list(range(size))
def __len__(self):
return len(self.data)
def __getitem__(self, idx):
return self.data[idx]
We use the ListDataset
to create a list
with 100
elements from 0 to 99.
dataset = ListDataset(100)
print(len(dataset))
print(dataset[42])
100 42
In practice we could for example use the Dataset to load an image for
the index received in the __getitem__
method. Below is a dummy
implementation of such a Dataset.
class ImagesDataset(Dataset):
def __init__(self, images_list):
# list containing information about the image
# "[/images/image0.jpg", "/images/image1.jpg]"
self.images_list = images_list
def __len__(self):
return len(self.images_list)
def __getitem__(self, idx):
file = self.images_list[idx]
image = open_image(file)
return image
During the training process we only directly interact with the DataLoader
object and not with the Dataset
object. The goal of the
DataLoader
is to return data in batch sized pieces. Those batches
can then be used for training or testing purposes. But what exaclty is a
batch? The batch size tells us what proportion of the whole dataset is going
to be used to calculate the graedients, before a single gradient descent
step is taken.
The approach of using the whole dataset to calculate the gradient is called batch gradient descent. Using the whole dataset has the advantage that we get a good estimation for the gradients, yet in many cases batch gradient descent is not used in practice. We often have to deal with datasets consisting of thousands of features and millions of samples. It is not possible to load all that data on the GPU's. Even if it was possible, it would take a lot of time to calculate the gradients for all the samples in order to take just a single training step.
In stochastic gradient descent we introduce some stochasticity by shuffling the dataset randomly and using one sample at a time to calculate the gradient and to take a gradient descent step until we have used all samples in the dataset. The advantage of stochastic gradient descent is that we do not have to wait for the calculation of gradients for all samples, but in the process we lose the advantages of parallelization that we get with batch gradient descent. When we calculate the gradient based on one sample the calculation is going to be off. By iterating over the whole dataset the sum of the directions is going to move the weights and biases towards the optimum. In fact this behaviour is often seen as advantageous, because theoretically the imprecise gradient could potentially push a variable from a local minimum.
Mini-batch gradient descent combines the advantages of the stochastic and batch gradient descent. Insdead of using one sample at a time ,several samples are utilized to calculate the gradients. Similar to the learning rate, the the mini-batch is a hyperparameter and needs to be determined by the developer. Usually the size is calculated as a power of 2, for example 32, 64, 128 and so on. You just need to remember that the batch needs to fit into the memory of your graphics card. The calculation of the gradients with mini-batches can be parallelized, because we can distribute the samples on different cores of the CPU/GPU. Additionally it has the advantage that theoretically our training dataset can be as large as we want.
The DataLoader
takes several arguments to control the above
described details. The dataset
argument expects a
Dataset
object that implements the __init__
and
__getitem__
interface. The batch_size
parameter determines the size of
the mini-batch. The default value is 1, which is equal to stochastic
gradient descent. The shuffle
parameter is a boolean value,
that detemines if the dataset will be shuffled at the beginning of the
iteration process. The default value is False
.
Let's generate a ListDataset with just 5 elements for demonstration purposes.
dataset = ListDataset(5)
We generate a DataLoader that shuffles the dataset object and returns 2 samples at a time.
dataloader = DataLoader(dataset=dataset, batch_size=2, shuffle=True)
Finally we iterate through the DataLoader and receive a batch at a time. Once only one object remains, a single element is returned.
for batch_num, data in enumerate(dataloader):
print(f'Batch Nr: {batch_num+1} Data: {data}')
Batch Nr: 1 Data: tensor([4, 0]) Batch Nr: 2 Data: tensor([3, 1]) Batch Nr: 3 Data: tensor([2])
Often we want our batches to always be of equal size. If a batch is too
small the calculation of the gradient might be too noisy. To avoid that
we can use the drop_last
argument. The
drop_last
parameter removes the last batch, if it is less than
batch_size
. The argument defaults to False
dataloader = DataLoader(dataset=dataset, batch_size=2, shuffle=True, drop_last=True)
When we do the same exercise again, we end up with fewer iterations.
for batch_num, data in enumerate(dataloader):
print(f'Batch Nr: {batch_num+1} Data: {data}')
Batch Nr: 1 Data: tensor([0, 2]) Batch Nr: 2 Data: tensor([3, 4])
Each sample in the dataset is typically used several times in the training process. Each iteration over the whole dataset is called an epoch.
Info
An epoch is the time period, in which all the samples in the dataset have been iterated over and used for gradient calculations.
If we want to use several epochs in a training loop, all we have to do is to include an additional outer loop.
for epoch_num in range(2):
for batch_num, data in enumerate(dataloader):
print(f'Epoch Nr: {epoch_num + 1} Batch Nr: {batch_num+1} Data: {data}')
Epoch Nr: 1 Batch Nr: 1 Data: tensor([2, 1]) Epoch Nr: 1 Batch Nr: 2 Data: tensor([4, 0]) Epoch Nr: 2 Batch Nr: 1 Data: tensor([2, 4]) Epoch Nr: 2 Batch Nr: 2 Data: tensor([3, 1])
Oftentiems it is useful to get the next batch of data using a separate
process, while we are still in the process of calculating the gradients.
The num_workers
parameter determines the number of workers,
that get the data in parallel. The default is 0, which means that only the
main process is used.
dataloader = DataLoader(dataset=dataset, batch_size=2, shuffle=True, drop_last=True, num_workers=4)
We won't notice the speed difference using such a simple example, but the speedup with large datasets might be noticable.
There are more parameters, that the DataLoader
class
provides. We are not going to cover those just yet, because for the most
part the usual parameters are sufficient. We will cover the special
cases when the need arises. If you are faced with a problem that
requires more control, you can look at the
PyTorch documentation.
Training Loop
The training loop that we implemented when we solved our circular problem works just fine, but PyTorch provides much better approaches. Once our neural network architectures get more and more complex, we will be glad that we are able to utilize a more efficient training approach.
import torch
import sklearn.datasets as datasets
from torch.utils.data import DataLoader, Dataset
This time around we explicitly set some parameters as constants. This time around we use a much higher number of samples and neurons, to demonstrate that PyTorch is able to handle those.
# parameters
DEVICE = ("cuda:0" if torch.cuda.is_available() else "cpu")
NUM_EPOCHS=10
BATCH_SIZE=1024
NUM_SAMPLES=1_000_000
NUM_FEATURES=10
ALPHA = 0.1
#number of hidden units in the first and second hidden layer
HIDDEN_SIZE_1 = 1000
HIDDEN_SIZE_2 = 500
We create a simple classification dataset with sklearn and construct a Dataset
object.
class Data(Dataset):
def __init__(self):
X, y = datasets.make_classification(
n_samples=NUM_SAMPLES,
n_features=NUM_FEATURES,
n_informative=7,
n_classes=2,
)
self.X = torch.from_numpy(X).to(torch.float32)
self.y = torch.from_numpy(y).to(torch.float32).view(-1, 1)
def __len__(self):
return len(self.X)
def __getitem__(self, idx):
return self.X[idx], self.y[idx]
dataset = Data()
dataloader = DataLoader(dataset=dataset,
batch_size=BATCH_SIZE,
shuffle=True,
drop_last=True,
num_workers=4)
This time around we will start by looking at the desired product, the
training loop, to understand what we need in order to make our code
clean, modular and scalable. Instead of calculating one layer after
another we will calculate our forward pass using a single call to the model
. The model will contain all the matrix multiplications and activation
functions needed to predict the probability that the features belong to
a certain class. The criterion
is essentially a loss
function, in our case it is the binary cross-entropy. The
optimizer
loops through all parameters of the model and applies gradient descent
when we call optimizer.step
and clears all the gradients
when we call optimizer.zero_grad()
.
def train(dataloader, model, criterion, optimizer):
for epoch in range(NUM_EPOCHS):
loss_sum = 0
batch_nums = 0
for batch_idx, (features, labels) in enumerate(dataloader):
# move features and labels to GPU
features = features.to(DEVICE)
labels = labels.to(DEVICE)
# ------ FORWARD PASS --------
probs = model(features)
# ------CALCULATE LOSS --------
loss = criterion(probs, labels)
# ------BACKPROPAGATION --------
loss.backward()
# ------GRADIENT DESCENT --------
optimizer.step()
# ------CLEAR GRADIENTS --------
optimizer.zero_grad()
# ------TRACK LOSS --------
batch_nums += 1
# detach() removes a tensor from a computational graph
# and cpu() move the tensor from GPU to CPU
loss_sum += loss.detach().cpu()
print(f'Epoch: {epoch+1} Loss: {loss_sum / batch_nums}')
In order to make our calculations more modular, we will create a Module
class. You can think about a module as a piece of a neural network.
Usually modules are those pieces of a network, that we use over and over
again. In essence you create a neural network by defining and stacking
modules. As we need to apply affine transformations several times, we
put the logic of a linear layer into a separate class and we call that
class Module
. This module initializes a weight matrix and a
bias vector. For easier access at a later point we create an attribute
parameters
, which is just a list holding the weights and
biases. We also implement the __call__
method, which contains
the logic for the forward pass.
class Module:
def __init__(self, in_features, out_features):
self.W = torch.normal(mean=0,
std=0.1,
size=(out_features, in_features),
requires_grad=True,
device=DEVICE,
dtype=torch.float32)
self.b = torch.zeros(1,
out_features,
requires_grad=True,
device=DEVICE,
dtype=torch.float32)
self.parameters = [self.W, self.b]
def __call__(self, features):
return features @ self.W.T + self.b
Our model needs an activation function, so we implement a sigmoid function.
def sigmoid(z):
return 1 / (1 + torch.exp(-z))
The Model
class is the abstraction of the neural network.
We will need three fully connected layers, so the model initializes
three linear modules. In the __call__
method we implement
forward pass of the neural network. So when we call
model(features)
, the features are processed by the neural
network, until the last layer is reached. Additionally we implement the
parameters
method, which returns the full list of the parameters
of the model.
class Model:
def __init__(self):
self.linear_1 = Module(NUM_FEATURES, HIDDEN_SIZE_1)
self.linear_2 = Module(HIDDEN_SIZE_1, HIDDEN_SIZE_2)
self.linear_3 = Module(HIDDEN_SIZE_2, 1)
def __call__(self, features):
x = self.linear_1(features)
x = sigmoid(x)
x = self.linear_2(x)
x = sigmoid(x)
x = self.linear_3(x)
x = sigmoid(x)
return x
def parameters(self):
parameters = [*self.linear_1.parameters,
*self.linear_2.parameters,
*self.linear_3.parameters]
return parameters
Below we test the forward pass with random numbers. Applying the forward pass of a predefined model should feel more intuitive than our previous implementations.
features = torch.randn(BATCH_SIZE, NUM_FEATURES).to(DEVICE)
model = Model()
output = model(features)
The optimizer class is responsible for applying gradient descent and for
clearing the gradients. Ours is a simple implementation of stochastic
(or batch) gradient descent, but PyTorch has many more implementations.
We will study those in future chapters. Our optimizer class needs the
learning rate (alpha) and the parameters of the model. When we call step()
we loop over all parameters and apply gradient descent and when we call
zero_grad()
we clear all the gradients. Notice that the optimizer
logic works independent of the exact architecture of the model, making the
code more managable.
class SGDOptimizer:
def __init__(self, parameters, alpha):
self.alpha = alpha
self.parameters = parameters
def step(self):
with torch.inference_mode():
for parameter in self.parameters:
parameter.sub_(self.alpha * parameter.grad)
def zero_grad(self):
with torch.inference_mode():
for parameter in self.parameters:
parameter.grad.zero_()
Finally we implement the loss function. Once again the calculation of the loss is independent of the model or the optimizer. When we change one of the components, we do not introduce any breaking changes. If we replace the cross-entropy by mean squared error, our training loop will still keep working.
def bce_loss(outputs, labels):
loss = -(labels * torch.log(outputs) + (1 - labels) * torch.log(1 - outputs)).mean()
return loss
Now we have all components, that are required by our training loop.
model = Model()
optimizer = SGDOptimizer(model.parameters(), ALPHA)
criterion = bce_loss
train(dataloader, model, criterion, optimizer)
Epoch: 1 Loss: 0.44153448939323425 Epoch: 2 Loss: 0.26614147424697876 Epoch: 3 Loss: 0.1991310715675354 Epoch: 4 Loss: 0.16552086174488068 Epoch: 5 Loss: 0.14674726128578186 Epoch: 6 Loss: 0.13339845836162567 Epoch: 7 Loss: 0.12402357161045074 Epoch: 8 Loss: 0.11728055775165558 Epoch: 9 Loss: 0.11224914342164993 Epoch: 10 Loss: 0.1082562804222107
You can probaly guess, that PyTorch provides classes and functions, that
we implemented above, out of the box. The PyTorch module torch.nn
contains most of the classes and functions, that we will require.
import torch.nn as nn
When we write custom PyTorch modules we need to subclass nn.Module
. We need to putall trainable parameters into the
nn.parameter.Parameter()
class. This tells PyTorch to put those tensors into the parameters list
(which is used by the optimizer) and the tensors are automatically
tracked for gradient computation. Instad of defining
__call__
as we did before, we define the forward
method. PyTorch
calls forward
automatically, when we call the module
object. You must never call this method directly, as PyTorch does
additional calculations during the forward pass, so instead of using
module.forward(features)
use module(features)
.
class Module(nn.Module):
def __init__(self, in_features, out_features):
super().__init__()
self.W = nn.parameter.Parameter(torch.normal(mean=0, std=0.1,
size=(out_features, in_features)))
self.b = nn.parameter.Parameter(torch.zeros(1, out_features))
def forward(self, features):
return features @ self.W.T + self.b
The great thing about PyTorch modules is their composability. Earlier
created modules can be used in subsequent modules. Below for example we
use the above defined Module
class in the
Model
module. In later chapter we will see how we can create
blocks of arbitrary complexity using this simple approach.
class Model(nn.Module):
def __init__(self):
super().__init__()
self.linear_1 = Module(NUM_FEATURES, HIDDEN_SIZE_1)
self.linear_2 = Module(HIDDEN_SIZE_1, HIDDEN_SIZE_2)
self.linear_3 = Module(HIDDEN_SIZE_2, 1)
def forward(self, features):
x = self.linear_1(features)
x = torch.sigmoid(x)
x = self.linear_2(x)
x = torch.sigmoid(x)
x = self.linear_3(x)
x = torch.sigmoid(x)
return x
PyTorch obviously provides loss functions and optimizers. We will use BCELoss
, which calculates the binary cross-entropy loss. Optimizers are
located in torch.optim
. For now we will use stochastic
gradient descent, but there are many more optimizers that we will
encounter soon.
model = Model().to(DEVICE)
criterion = nn.BCELoss()
optimizer = torch.optim.SGD(model.parameters(), lr=ALPHA)
train(dataloader, model, criterion, optimizer)
Epoch: 1 Loss: 0.4358866512775421 Epoch: 2 Loss: 0.26300883293151855 Epoch: 3 Loss: 0.1951223760843277 Epoch: 4 Loss: 0.16517716646194458 Epoch: 5 Loss: 0.14785249531269073 Epoch: 6 Loss: 0.1351807564496994 Epoch: 7 Loss: 0.12569186091423035 Epoch: 8 Loss: 0.11819736659526825 Epoch: 9 Loss: 0.11242685467004776 Epoch: 10 Loss: 0.10799615830183029
PyTorch provides a lot of modules out of the box. An affine/linear
transformation layer is a common procedure, therefore you should use nn.Linear
instead of implementing your solutions from scratch.
class Model(nn.Module):
def __init__(self):
super().__init__()
self.linear_1 = nn.Linear(NUM_FEATURES, HIDDEN_SIZE_1)
self.linear_2 = nn.Linear(HIDDEN_SIZE_1, HIDDEN_SIZE_2)
self.linear_3 = nn.Linear(HIDDEN_SIZE_2, 1)
def forward(self, features):
x = self.linear_1(features)
x = torch.sigmoid(x)
x = self.linear_2(x)
x = torch.sigmoid(x)
x = self.linear_3(x)
x = torch.sigmoid(x)
return x
model = Model().to(DEVICE)
criterion = nn.BCELoss()
optimizer = torch.optim.SGD(model.parameters(), lr=ALPHA)
train(dataloader, model, criterion, optimizer)
Epoch: 1 Loss: 0.46121323108673096 Epoch: 2 Loss: 0.345653235912323 Epoch: 3 Loss: 0.26799750328063965 Epoch: 4 Loss: 0.20885568857192993 Epoch: 5 Loss: 0.16782595217227936 Epoch: 6 Loss: 0.14582592248916626 Epoch: 7 Loss: 0.1313050240278244 Epoch: 8 Loss: 0.12312141805887222 Epoch: 9 Loss: 0.11707331985235214 Epoch: 10 Loss: 0.11287659406661987
To finish this chapter let us discuss an additional PyTorch
convenience.You might have noticed, that all modules and activation
functions are called one after another, where the output of one module
(or activation) is used as the input into the next. In that case we can
pack all modules and activations into a nn.Sequential
object.
When we call that object, the components will be executed in a sequential
order.
class Model(nn.Module):
def __init__(self):
super().__init__()
self.layers = nn.Sequential(
nn.Linear(NUM_FEATURES, HIDDEN_SIZE_1),
nn.Sigmoid(),
nn.Linear(HIDDEN_SIZE_1, HIDDEN_SIZE_2),
nn.Sigmoid(),
nn.Linear(HIDDEN_SIZE_2, 1),
)
def forward(self, features):
return self.layers(features)
model = Model().to(DEVICE)
criterion = nn.BCELoss()
optimizer = torch.optim.SGD(model.parameters(), lr=ALPHA)
train(dataloader, model, criterion, optimizer)
Epoch: 1 Loss: 0.4605180025100708 Epoch: 2 Loss: 0.3372548818588257 Epoch: 3 Loss: 0.27341559529304504 Epoch: 4 Loss: 0.22028055787086487 Epoch: 5 Loss: 0.17632894217967987 Epoch: 6 Loss: 0.15047569572925568 Epoch: 7 Loss: 0.1337045431137085 Epoch: 8 Loss: 0.12339214235544205 Epoch: 9 Loss: 0.11565018445253372 Epoch: 10 Loss: 0.11087213456630707