Minimizing Cross-Entropy
We finally have the means to find the weight vector \mathbf{w} undefined and the bias b undefined that minimize the binary cross-entropy loss, so let's see how we can accomplish this goal.
Info
\text{Cross-Enropy} = L = - \dfrac{1}{n} \sum_i \Big[y^{(i)} \log \sigma(z^{(i)}) + (1 - y^{(i)}) \log(1 - \sigma(z^{(i)})) \Big] \\ undefined
The cross-entropy loss is a relatively complex composite function with many moving parts, but if we focus on the atomic componets of the function and construct a computational graph, as we did for linear regression, we can utilize the chain rule and the calculation of gradients becomes relatively straightforward.
In this sec1ion we will try to find the optimal weights and bias that define the decision boundary between the two categories in the plot below.
The dataset consists of two features, so we start to build our computational graph by multiplying each of the features by a corresponding vector.
x_1 undefined | x_2 undefined | y undefined |
---|---|---|
0 | 0 | 0 |
0.1 | 0.23 | 0 |
... | ... | ... |
0.75 | 0.89 | 1 |
1 | 1 | 1 |
We call the two scaled values s_1 undefined and s_2 undefined respectively and add them together.
s_1 = w_1 x_1 \\ s_2 = w_1 x_1 undefinedWe sum the two scaled values and add the bias to get the net input z undefined .
z = s_1 + s_2 + b undefinedAssuming that the features of the sample are 0.75 and 0.89 respectively, the weights are 0.5 and -0.5 respectively and the bias is 1, we get the following computational graph so far.
In the next step the net input is used as an input into the sigmoid function \dfrac{1}{1 + e^{-z}} undefined to get the output a undefined .
Next we use the output of the sigmoid as an input to the cross-entropy loss. When you look at the (single sample) loss function L = -\Big[y \log a + (1 - y) \log(1 - a) \Big] undefined , you will notice that the loss is dependent on the label y undefined . If the label is 1, the loss collapses to -\log (a) undefined and if the label is 0 the loss collapses to -\log(1 - a) undefined . The sample that we have been looking so far corresponds to a label of 1, so our computational graph looks as follows.
When we deal with batch or mini-batch gradient descent, we would do the same exercise for many other samples and our computational graph would get additional nodes. But let's keep the computation simple and assume that we are dealing with stochastic gradient descent and would like to calculate the gradients using a single sample. The procedure is obviously the same that we used with linear regression. We start at the top node and keep calculating the intermediary gradients until we reach the weiths and the bias. Along the way we multiply the local gradients by the gradients from the above nodes.
You should be already familiar with basic differentiation rules. The only difficulty you might face is the derivative of the sigmoid function \sigma(z) = \dfrac{1}{1 + e^{-z}} undefined with respect to the net input z undefined .
The derivative of the sigmoid function is relatively straightforward, but the derivation process is somewhat mathematically involved. It is not necessary to know the exact steps how we arrive at the derivative, but if you are interested, below we provide the necessary steps.
Info
\begin{aligned} \dfrac{\partial}{\partial z}\sigma &= \dfrac{\partial}{\partial z} \dfrac{1}{1 + e^{-z}} \\ &= \dfrac{\partial}{\partial z} ({1 + e^{-z}})^{-1} \\ &= -({1 + e^{-z}})^{-2} * -e^{-z} \\ &= \dfrac{ e^{-z}}{({1 + e^{-z}})^{-2}} \\ &= \dfrac{1}{{1 + e^{-z}}} \dfrac{ e^{-z}}{{1 + e^{-z}}} \\ &= \dfrac{1}{{1 + e^{-z}}} \dfrac{ 1 + e^{-z} - 1}{{1 + e^{-z}}} \\ &= \dfrac{1}{{1 + e^{-z}}} \Big(\dfrac{ 1 + e^{-z}}{1 + e^{-z}} - \dfrac{1}{{1 + e^{-z}}}\Big) \\ &= \dfrac{1}{{1 + e^{-z}}} \Big(1 - \dfrac{1}{{1 + e^{-z}}}\Big) \\ &= \sigma(z) (1 - \sigma(z)) \end{aligned} undefined
Calculating the gradients from top to bottom results in the following gradients.
When we deal with several samples the computation does not get much more complicated.
L = - \dfrac{1}{n} \sum_i \Big[y^{(i)} \log \sigma(z^{(i)}) + (1 - y^{(i)}) \log(1 - \sigma(z^{(i)})) \Big] \\ undefinedAs always the gradient of a sum is the sum of the gradients, so the weights and the bias would accumulate the gradients through several samples and eventually scaled by \dfrac{1}{n} undefined .
In the interactive example below we demonstrate the gradient descent algorithm for logistic regression. This is the same example that you tried to solve manually in a previous chapter. Start the algorithm and observe how the loss decreases over time.
Variable | Value |
---|---|
L undefined | 0.00 |
w_1 undefined | -0.63 |
w_2 undefined | 0.60 |
b undefined | 0.11 |
The gradient descent algorithm learns to separate the data in a matter of seconds.
We can implement logistic regression in PyTorch, using the same techniques that we used with linear regression. Hardly any parts of the code need to change.
import torch
import sklearn.datasets as datasets
X, y = datasets.make_classification(n_samples=4, n_features=4)
X = torch.from_numpy(X).to(torch.float32)
y = torch.from_numpy(y).to(torch.float32).unsqueeze(1)
def init_weights():
w = torch.randn(1, 4, requires_grad=True)
b = torch.randn(1, 1, requires_grad=True)
return w, b
The only code snippet, that is truly different is the forward pass. Here we calculate the cross-entropy loss, using some of the built-in PyTorch functionalities.
def forward(w, b):
z = X @ w.T + b
sigma = torch.sigmoid(z)
loss = y * torch.log(sigma) + (1 - y) * torch.log(1 - sigma)
loss = -loss.mean()
return loss
lr = 0.1
w, b = init_weights()
The training loop remains basically the same.
for _ in range(10):
# forward pass
cross_entropy = forward(w, b)
print(f'Cross Entropy: {cross_entropy.data}')
# backward pass
cross_entropy.backward()
# gradient descent
with torch.inference_mode():
w.data.sub_(w.grad * lr)
b.data.sub_(b.grad * lr)
w.grad.zero_()
b.grad.zero_()
Cross Entropy: 1.1454823017120361 Cross Entropy: 1.0852206945419312 Cross Entropy: 1.0285975933074951 Cross Entropy: 0.9757254123687744 Cross Entropy: 0.9266657829284668 Cross Entropy: 0.8814209699630737 Cross Entropy: 0.8399296402931213 Cross Entropy: 0.8020696640014648 Cross Entropy: 0.7676653861999512 Cross Entropy: 0.7365001440048218