Cross-Entropy Loss
The mean squared error loss tends to be problematic, when used as the loss function for classification tasks[1] . The loss that is usually used in classification tasks is called the cross-entropy (or the negative log likelihood loss).
In 1948 Claude Shanon published an article called "A Mathematical Theory of Communication"[1] . This paper introduced a theoretical foundation for a field that has become known as information theory.
At first glance it might look like we are about to go on a tangent here, because information theory and the loss function for classification tasks should't have a lot in common. Yet the opposite is the case.
Info
In order to understand the cross-entropy loss it is essential to understand information theory!
We measure information using specific information units. The most common unit of information is the so called bit[2] [3] , which takes a value of either 0 or 1. Below for example we use 8 bits to encode and send some information.
While we use 8 bits to send a message, we do not actually know how much of that information is useful. To get an intuition regarding that statement let us look at a simple toss coin example.
Let us first imagine, that we are dealing with a fair coin, which means that the probability to get either heads or tails is exactly 50%.
To send a message regarding the outcome of the fair coin toss we need 1 bit. We could for example define heads as 1 and tails as 0. The recepient of the message can remove the uncertainty regarding the coin toss outcome by simply looking at the value of the bit.
But what if we deal with an unfair coin where heads comes up with a probability of 1.
We could still send 1 bit, but there would be no useful information contained in the message, because the recepient has no uncertainty regarding the outcome of the toss coin. Sending a bit in such a manner would be a waste of resources.
Let's try to formalize the ideas we described above.
Info
Information is inversely related to probability.
We expect less likely events to provide more information than more likely events. In fact an event with a probability of 50% provides exactly 1 bit of information. Or to put it differently, one useful bit reduces uncertainty by exactly 2. Two bits of useful information reduce uncertainty by 4, three bits by 8 and so on.
Info
We can convert probability p undefined of an event x undefined into bits of information I undefined using the following equation.
We can use basic math to solve \Big(\dfrac{1}{2}\Big)^I = p(x) undefined for information in bits I undefined .
Info
\begin{aligned} &\Big(\frac{1}{2}\Big)^I = p(x) \\ & 2^I = \frac{1}{p(x)} \\ & I = \log_2\Big(\frac{1}{p(x)}\Big) \\ & I = \log_2(1) - \log_2(p(x)) \\ & I = 0 - \log_2(p(x)) \\ \end{aligned} \\ \boxed{I= -\log_2(p(x))} undefined
We can plot the relationship between the probability p undefined of an event x undefined and the information measured in bits -\log_2(p(x)) undefined .
Info
The lower the probability of an event, the higher the information.
Often we are not only interested in the amount of bits that is provided by a particular event. Instead we are interested in the expected value of information (the expected number of bits), that is contained in the whole probability distribution p undefined . This measure is called entropy.
Info
The entropy H(p) undefined of the probability distrubution p undefined is defined as the expected level of information or the expected number of bits.
Below you can use an interactive example of a binomial distribution where you can change the probability of heads and tails. When you have a fair coin entropy amounts to exactly 1. When the probability starts getting uneven, entropy reduces until it reaches a value of 0.
Intuitively speaking, the entropy is a measure of order of a probability distribution. Entropy is highest, when all the possible events have the same probability and entropy is 0 when one of the events has the probability 1 while all other events have a probability of 0.
Now let us return to the fair coin toss example. Using the equation and the example above, we know that the entropy is 1. We should therefore try and send the message with the result of the coin toss using 1 bit of information.
In the example below on other hand we use an inefficient encoding. We always send 2 bits of information when we get heads and 2 bits when we get tails.
The entropy of the probability distribution is just 0.97 bits.
Yet the average message length, also known as cross-entropy is 2 bits.
By using 2 bits to encode the message, we implicitly assume a different probability distribution, than the one that produced the coin toss. Remember that 2 bits would for example correspond to a distribution with 4 likely events, each occuring with a probability of 25%. In a way we can say, that the cross-entropy allows us to measure the difference between two distributions. Only when the distribution that produced the event and the distribution we use to encode the message are identical, does the cross-entropy reach its minimum value. In that case the cross-entropy and the entropy are identical.
Info
The cross-entropy is defined as the average message length. Given two distributions p(x) undefined and q(x) undefined we can calculate the cross-entropy H(p, q) undefined .
In the below example the red distribution is p(x) undefined and the yellow distribution is q(x) undefined . When you move the slider to the right q(x) undefined starts moving towards p(x) undefined and you can observe that the cross-entropy gets lower and lower until its' minimal value is reached. In that case the two distributions are identical and the cross-entropy is equal to the entropy.
Now it is time to come full circle and to relate the calculation of the cross-entropy to our initial task: find a loss function that is suited for classification tasks.
Let us assume that we are dealing with a problem, where we have to classify an animal based on certain features in one of the five categories: cat, dog, pig, bear or monkey. The cross-entropy deals with probability distributions, so we need to put the label into a format that equals a probability distribution. For example if we deal with a sample that depicts a cat, the true probability distribution would be 100% for the category cat and 0% for all other categories. This distribution is put in a so called "one-hot" vector. A vector that contains a one for the relevant category and 0 otherwise. So that we have the following distributions, p(x) undefined .
The distributions that are produced by the sigmoid or the softmax functions on the other hand are just estimations. We designate this distribution q(x) undefined .
Now we have everything to calculate the cross-entropy. The closer the one hot distribution and the distribution produced by the logistic regression or the neural network get, the lower the cross-entropy gets. Because all the weight of the one hot vector is on just one event, the entropy corresponds to exactly 0, which means the cross-entropy could theoretically also reach 0. Our goal in a classification task is to minimize the cross-entropy to get the two distributions as close as possible.
Below is an interactive example where the true label corresponds to the category cat. The estimated probabilities are far from the ground truth, which results in a relatively high cross-entropy. When you move the slider, the estimated probabilities start moving towards the ground truth, which pushes the cross-entropy down, until it reaches a value of 0.
In logistic regression we utilize the sigmoid activation function \hat{y} = \dfrac{1}{1 + e^{-(\mathbf{w^Tx}+b)}} undefined , which produces values between 0 and 1. The sigmoid function can be used to differentiate between 2 categories . The sigmoid function produces the probability \hat{y} undefined to belong to the first category (e.g. cat), therefore 1 - \hat{y} undefined returns the probability to belong to the second category (e.g. dog). If we additionally define that the label y undefined is 1 when the sample is a cat and 0 when the sample is a dog, the expression reduces the cross-entropy to the so called binary cross-entropy.
Info
When we are dealing with a classification problem we use the cross-entropy as the loss function. We use the binary cross-entropy when we have just 2 categories.
When we shift the weights and the bias of the sigmoid function, we can move the probability to belong to a certain category closer to the ground truth in order to reduce the cross-entropy. In the next section we will demonstrate how we can we can utilize gradient descent for that purpose. For now you can play with the interactive example below.
In practice we always deal with with a dataset, therefore the cross-entropy loss that we are going to optimize is going to be the average over the whole dataset.
Info
H(p, q) = -\dfrac{1}{n}\sum_i y^{(i)} \log (\hat{y}^{(i)}) + (1 - y^{(i)}) \log ( 1 - \hat{y}^{(i)}) undefined