Minimizing Cross-Entropy

We finally have the means to find the weight vector \mathbf{w} undefined and the bias b undefined that minimize the binary cross-entropy loss, so let's see how we can accomplish this goal.

Info

\text{Cross-Enropy} = L = - \dfrac{1}{n} \sum_i \Big[y^{(i)} \log \sigma(z^{(i)}) + (1 - y^{(i)}) \log(1 - \sigma(z^{(i)})) \Big] \\ undefined

The cross-entropy loss is a relatively complex composite function with many moving parts, but if we focus on the atomic componets of the function and construct a computational graph, as we did for linear regression, we can utilize the chain rule and the calculation of gradients becomes relatively straightforward.

In this sec1ion we will try to find the optimal weights and bias that define the decision boundary between the two categories in the plot below.

The dataset consists of two features, so we start to build our computational graph by multiplying each of the features by a corresponding vector.

x_1 undefined	x_2 undefined	y undefined
0	0	0
0.1	0.23	0
...	...	...
0.75	0.89	1
1	1	1

We call the two scaled values s_1 undefined and s_2 undefined respectively and add them together.

undefined

We sum the two scaled values and add the bias to get the net input z undefined .

undefined

Assuming that the features of the sample are 0.75 and 0.89 respectively, the weights are 0.5 and -0.5 respectively and the bias is 1, we get the following computational graph so far.

In the next step the net input is used as an input into the sigmoid function \dfrac{1}{1 + e^{-z}} undefined to get the output a undefined .

Next we use the output of the sigmoid as an input to the cross-entropy loss. When you look at the (single sample) loss function L = -\Big[y \log a + (1 - y) \log(1 - a) \Big] undefined , you will notice that the loss is dependent on the label y undefined . If the label is 1, the loss collapses to -\log (a) undefined and if the label is 0 the loss collapses to -\log(1 - a) undefined . The sample that we have been looking so far corresponds to a label of 1, so our computational graph looks as follows.

When we deal with batch or mini-batch gradient descent, we would do the same exercise for many other samples and our computational graph would get additional nodes. But let's keep the computation simple and assume that we are dealing with stochastic gradient descent and would like to calculate the gradients using a single sample. The procedure is obviously the same that we used with linear regression. We start at the top node and keep calculating the intermediary gradients until we reach the weiths and the bias. Along the way we multiply the local gradients by the gradients from the above nodes.

You should be already familiar with basic differentiation rules. The only difficulty you might face is the derivative of the sigmoid function \sigma(z) = \dfrac{1}{1 + e^{-z}} undefined with respect to the net input z undefined .

undefined

The derivative of the sigmoid function is relatively straightforward, but the derivation process is somewhat mathematically involved. It is not necessary to know the exact steps how we arrive at the derivative, but if you are interested, below we provide the necessary steps.

Info

\begin{aligned} \dfrac{\partial}{\partial z}\sigma &= \dfrac{\partial}{\partial z} \dfrac{1}{1 + e^{-z}} \\ &= \dfrac{\partial}{\partial z} ({1 + e^{-z}})^{-1} \\ &= -({1 + e^{-z}})^{-2} * -e^{-z} \\ &= \dfrac{ e^{-z}}{({1 + e^{-z}})^{-2}} \\ &= \dfrac{1}{{1 + e^{-z}}} \dfrac{ e^{-z}}{{1 + e^{-z}}} \\ &= \dfrac{1}{{1 + e^{-z}}} \dfrac{ 1 + e^{-z} - 1}{{1 + e^{-z}}} \\ &= \dfrac{1}{{1 + e^{-z}}} \Big(\dfrac{ 1 + e^{-z}}{1 + e^{-z}} - \dfrac{1}{{1 + e^{-z}}}\Big) \\ &= \dfrac{1}{{1 + e^{-z}}} \Big(1 - \dfrac{1}{{1 + e^{-z}}}\Big) \\ &= \sigma(z) (1 - \sigma(z)) \end{aligned} undefined

Calculating the gradients from top to bottom results in the following gradients.

When we deal with several samples the computation does not get much more complicated.

undefined

As always the gradient of a sum is the sum of the gradients, so the weights and the bias would accumulate the gradients through several samples and eventually scaled by \dfrac{1}{n} undefined .

In the interactive example below we demonstrate the gradient descent algorithm for logistic regression. This is the same example that you tried to solve manually in a previous chapter. Start the algorithm and observe how the loss decreases over time.

Variable	Value
L undefined	0.00
w_1 undefined	-0.63
w_2 undefined	0.60
b undefined	0.11

The gradient descent algorithm learns to separate the data in a matter of seconds.

We can implement logistic regression in PyTorch, using the same techniques that we used with linear regression. Hardly any parts of the code need to change.

import torch
import sklearn.datasets as datasets

X, y = datasets.make_classification(n_samples=4, n_features=4)

X = torch.from_numpy(X).to(torch.float32)
y = torch.from_numpy(y).to(torch.float32).unsqueeze(1)

def init_weights():
    w = torch.randn(1, 4, requires_grad=True)
    b = torch.randn(1, 1, requires_grad=True)
    return w, b

The only code snippet, that is truly different is the forward pass. Here we calculate the cross-entropy loss, using some of the built-in PyTorch functionalities.

def forward(w, b):
    z = X @ w.T + b
    sigma = torch.sigmoid(z)
    loss = y * torch.log(sigma) + (1 - y) * torch.log(1 - sigma)
    loss = -loss.mean()
    return loss

lr = 0.1
w, b = init_weights()

The training loop remains basically the same.

for _ in range(10):
    # forward pass
    cross_entropy = forward(w, b)
    
    print(f'Cross Entropy: {cross_entropy.data}')
    
    # backward pass
    cross_entropy.backward()
    
    # gradient descent
    with torch.inference_mode():
        w.data.sub_(w.grad * lr)
        b.data.sub_(b.grad * lr)
        w.grad.zero_()
        b.grad.zero_()

Cross Entropy: 1.1454823017120361
Cross Entropy: 1.0852206945419312
Cross Entropy: 1.0285975933074951
Cross Entropy: 0.9757254123687744
Cross Entropy: 0.9266657829284668
Cross Entropy: 0.8814209699630737
Cross Entropy: 0.8399296402931213
Cross Entropy: 0.8020696640014648
Cross Entropy: 0.7676653861999512
Cross Entropy: 0.7365001440048218