VGG
The VGG[1] ConvNet architecture was developed by the Visual Geometry Group, a computer vision research lab at Oxford university. The neural network is similar in spirit to LeNet-5 and AlexNet, but VGG is a much deeper neural network. Unlike AlexNet, VGG does not apply any large filters, but uses only small patches of 3x3. The authors attributed this design choice to the success of their neural network. VGG got second place for object classification and first place for object detection in the 2014 ImageNet challenge.
The VGG paper discussed networks of varying depth, from 11 layers to 19 layers. We are going to discuss the 16 layer architecture, the so called VGG16 (architecture D in the paper).
As with many other deep learning architectures, VGG reuses the same module over and over again. The VGG module uses a convolutional layer with the kernel size of 3x3, stride of size 1 and padding of size 1, followed by batch normalization and the ReLU activation function. Be aware, that the BatchNorm2d layer was not used in the original VGG paper, but if you omit the normalization step, the network might suffer from vanishing gradients.
After a couple of such modules, we apply a max pooling layer with a kernel of 2 and a stride of 2.
The full VGG16 implementation looks as follows.
Type | Input Size | Output Size |
---|---|---|
VGG Module | 224x224x3 | 224x224x64 |
VGG Module | 224x224x64 | 224x224x64 |
Max Pooling | 224x224x64 | 112x112x64 |
VGG Module | 112x112x64 | 112x112x128 |
VGG Module | 112x112x128 | 112x112x128 |
Max Pooling | 112x112x128 | 56x56x128 |
VGG Module | 56x56x128 | 56x56x256 |
VGG Module | 56x56x256 | 56x56x256 |
VGG Module | 56x56x256 | 56x56x256 |
Max Pooling | 56x56x256 | 28x28x256 |
VGG Module | 28x28x256 | 28x28x512 |
VGG Module | 28x28x512 | 28x28x512 |
VGG Module | 28x28x512 | 28x28x512 |
Max Pooling | 28x28x512 | 14x14x512 |
VGG Module | 14x14x512 | 14x14x512 |
VGG Module | 14x14x512 | 14x14x512 |
VGG Module | 14x14x512 | 14x14x512 |
Max Pooling | 14x14x512 | 7x7x512 |
Dropout | - | - |
Fully Connected | 25088 | 4096 |
ReLU | - | - |
Dropout | - | - |
Fully Connected | 4096 | 4096 |
ReLU | - | - |
Fully Connected | 4096 | 1000 |
Softmax | - | - |
Below we implement VGG16 to classify the images in the CIFAR-10 datset.
import time
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
import torch
from torch import nn, optim
from torch.utils.data import DataLoader, Subset
from torchvision.datasets.cifar import CIFAR10
from torchvision import transforms as T
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
train_transform = T.Compose([T.Resize((50, 50)),
T.ToTensor(),
T.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))])
train_val_dataset = CIFAR10(root='../datasets', download=True, train=True, transform=train_transform)
# split dataset into train and validate
indices = list(range(len(train_val_dataset)))
train_idxs, val_idxs = train_test_split(
indices, test_size=0.1, stratify=train_val_dataset.targets
)
train_dataset = Subset(train_val_dataset, train_idxs)
val_dataset = Subset(train_val_dataset, val_idxs)
# In the paper a batch size of 256 was used
batch_size=128
train_dataloader = DataLoader(
dataset=train_dataset,
batch_size=batch_size,
shuffle=True,
num_workers=4,
drop_last=True,
)
val_dataloader = DataLoader(
dataset=val_dataset,
batch_size=batch_size,
shuffle=False,
num_workers=4,
drop_last=False,
)
We create a VGG_Block
module, that we can reuse many times.
class VGG_Block(nn.Module):
def __init__(self, in_channels, out_channels):
super().__init__()
self.layer = nn.Sequential(
nn.Conv2d(in_channels, out_channels, kernel_size=3, padding=1),
nn.BatchNorm2d(num_features=out_channels),
nn.ReLU(inplace=True)
)
def forward(self, x):
return self.layer(x)
VGG has a lot of repeatable blocks. It is common practice to store the configuration in a list and to construct the model from the config. The numbers represent the number of output filters in a convolutional layer. 'M' on the other hand indicates a maxpool layer.
cfg = [64, 64, "M", 128, 128, "M", 256, 256, 256, "M", 512, 512, 512, "M", 512, 512, 512, "M"]
Out model implementation is very close to the table above, but we have to account for the fact that our images are smaller, so we reduce the input in the first linear layer from 7x7x512 to 1x1x512.
class Model(nn.Module):
def __init__(self, cfg, num_classes=1):
super().__init__()
self.cfg = cfg
self.feature_extractor = self._make_feature_extractor()
self.classifier = nn.Sequential(
nn.Flatten(),
nn.Dropout(p=0.5),
nn.Linear(512, 4096),
nn.ReLU(inplace=True),
nn.Dropout(p=0.5),
nn.Linear(4096, 4096),
nn.ReLU(inplace=True),
nn.Linear(4096, 10)
)
def _make_feature_extractor(self):
layers = []
in_channels = 3
for element in self.cfg:
if element == "M":
layers += [nn.MaxPool2d(kernel_size=2, stride=2)]
else:
layers += [VGG_Block(in_channels, element)]
in_channels = element
return nn.Sequential(*layers)
def forward(self, x):
x = self.feature_extractor(x)
x = self.classifier(x)
return x
def track_performance(dataloader, model, criterion):
# switch to evaluation mode
model.eval()
num_samples = 0
num_correct = 0
loss_sum = 0
# no need to calculate gradients
with torch.inference_mode():
for _, (features, labels) in enumerate(dataloader):
with torch.autocast(device_type="cuda", dtype=torch.float16):
features = features.to(device)
labels = labels.to(device)
logits = model(features)
predictions = logits.max(dim=1)[1]
num_correct += (predictions == labels).sum().item()
loss = criterion(logits, labels)
loss_sum += loss.cpu().item()
num_samples += len(features)
# we return the average loss and the accuracy
return loss_sum / num_samples, num_correct / num_samples
def train(
num_epochs,
train_dataloader,
val_dataloader,
model,
criterion,
optimizer,
scheduler=None,
):
model.to(device)
scaler = torch.cuda.amp.GradScaler()
for epoch in range(num_epochs):
start_time = time.time()
for _, (features, labels) in enumerate(train_dataloader):
model.train()
features = features.to(device)
labels = labels.to(device)
# Empty the gradients
optimizer.zero_grad()
with torch.autocast(device_type="cuda", dtype=torch.float16):
# Forward Pass
logits = model(features)
# Calculate Loss
loss = criterion(logits, labels)
# Backward Pass
scaler.scale(loss).backward()
# Gradient Descent
scaler.step(optimizer)
scaler.update()
val_loss, val_acc = track_performance(val_dataloader, model, criterion)
end_time = time.time()
s = (
f"Epoch: {epoch+1:>2}/{num_epochs} | "
f"Epoch Duration: {end_time - start_time:.3f} sec | "
f"Val Loss: {val_loss:.5f} | "
f"Val Acc: {val_acc:.3f} |"
)
print(s)
if scheduler:
scheduler.step(val_loss)
model = Model(cfg)
optimizer = optim.Adam(params=model.parameters(), lr=0.001)
scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(
optimizer, factor=0.1, patience=2, verbose=True
)
criterion = nn.CrossEntropyLoss(reduction="sum")
When we train VGG16 on the CIFAR-10 dataset, we reach an accuracy of roughly 88%, thereby beating the LeCun-5 and the AlexNet implementation.
train(
num_epochs=30,
train_dataloader=train_dataloader,
val_dataloader=val_dataloader,
model=model,
criterion=criterion,
optimizer=optimizer,
scheduler=scheduler,
)
Epoch: 1/30 | Epoch Duration: 25.635 sec | Val Loss: 1.81580 | Val Acc: 0.296 |
Epoch: 2/30 | Epoch Duration: 24.916 sec | Val Loss: 1.38543 | Val Acc: 0.463 |
Epoch: 3/30 | Epoch Duration: 25.014 sec | Val Loss: 1.28278 | Val Acc: 0.547 |
Epoch: 4/30 | Epoch Duration: 25.074 sec | Val Loss: 1.19473 | Val Acc: 0.595 |
Epoch: 5/30 | Epoch Duration: 25.043 sec | Val Loss: 0.88059 | Val Acc: 0.689 |
Epoch: 6/30 | Epoch Duration: 25.063 sec | Val Loss: 0.71676 | Val Acc: 0.752 |
Epoch: 7/30 | Epoch Duration: 25.054 sec | Val Loss: 0.69538 | Val Acc: 0.760 |
Epoch: 8/30 | Epoch Duration: 25.065 sec | Val Loss: 0.77932 | Val Acc: 0.738 |
Epoch: 9/30 | Epoch Duration: 25.053 sec | Val Loss: 0.64442 | Val Acc: 0.792 |
Epoch: 10/30 | Epoch Duration: 25.080 sec | Val Loss: 0.55705 | Val Acc: 0.817 |
Epoch: 11/30 | Epoch Duration: 25.084 sec | Val Loss: 0.54697 | Val Acc: 0.821 |
Epoch: 12/30 | Epoch Duration: 25.086 sec | Val Loss: 0.51530 | Val Acc: 0.836 |
Epoch: 13/30 | Epoch Duration: 25.099 sec | Val Loss: 0.52571 | Val Acc: 0.832 |
Epoch: 14/30 | Epoch Duration: 25.081 sec | Val Loss: 0.52763 | Val Acc: 0.834 |
Epoch: 15/30 | Epoch Duration: 25.100 sec | Val Loss: 0.51354 | Val Acc: 0.852 |
Epoch: 16/30 | Epoch Duration: 25.063 sec | Val Loss: 0.49283 | Val Acc: 0.854 |
Epoch: 17/30 | Epoch Duration: 25.072 sec | Val Loss: 0.60646 | Val Acc: 0.839 |
Epoch: 18/30 | Epoch Duration: 25.110 sec | Val Loss: 0.68762 | Val Acc: 0.831 |
Epoch: 19/30 | Epoch Duration: 25.067 sec | Val Loss: 0.55200 | Val Acc: 0.852 |
Epoch 00019: reducing learning rate of group 0 to 1.0000e-04.
Epoch: 20/30 | Epoch Duration: 25.090 sec | Val Loss: 0.52681 | Val Acc: 0.877 |
Epoch: 21/30 | Epoch Duration: 25.084 sec | Val Loss: 0.54211 | Val Acc: 0.880 |
Epoch: 22/30 | Epoch Duration: 25.084 sec | Val Loss: 0.59634 | Val Acc: 0.878 |
Epoch 00022: reducing learning rate of group 0 to 1.0000e-05.
Epoch: 23/30 | Epoch Duration: 25.104 sec | Val Loss: 0.59584 | Val Acc: 0.881 |
Epoch: 24/30 | Epoch Duration: 25.052 sec | Val Loss: 0.60467 | Val Acc: 0.880 |
Epoch: 25/30 | Epoch Duration: 25.068 sec | Val Loss: 0.61155 | Val Acc: 0.880 |
Epoch 00025: reducing learning rate of group 0 to 1.0000e-06.
Epoch: 26/30 | Epoch Duration: 25.117 sec | Val Loss: 0.61680 | Val Acc: 0.879 |
Epoch: 27/30 | Epoch Duration: 25.059 sec | Val Loss: 0.62156 | Val Acc: 0.881 |
Epoch: 28/30 | Epoch Duration: 25.089 sec | Val Loss: 0.61393 | Val Acc: 0.878 |
Epoch 00028: reducing learning rate of group 0 to 1.0000e-07.
Epoch: 29/30 | Epoch Duration: 25.077 sec | Val Loss: 0.62117 | Val Acc: 0.880 |
Epoch: 30/30 | Epoch Duration: 25.075 sec | Val Loss: 0.61320 | Val Acc: 0.880 |