Sigmoid and Softmax

Let us assume that no classification algorithms have been invented yet and that we want to come up with the first classification algorithm. We are assuming that there should be learnable parameters \mathbf{w} undefined and b undefined and the output of the model should correspond the probability to belong to one of two categories. We will expand our ideas to more categories at a later step.

Linear Regression

Let's see what happens when we simply use linear regression for classification tasks.

Our dataset contains two classes. We assign each of the classes eather the label 0 or 1 and we need to train a model that produces values between 0 and 1. These values can be regarded as probabilities to belong to the category 1. If the output is 0.3 for example, the model predicts that we are dealing with category 1 with 30% probability and with 70% probability we are dealing with category 0.

We could draw a line just like the one below and at first glance this seems to be a reasonable approach. Higher values of some feature correspond to a higher probability to belong to the "blue" category and lower values of the same feature correspond to a lower probability.

While linear regression might work during training, when we start facing new datapoints we might get into trouble, because our model can theoretically produce results that are above 1 or below 0, values that can not be interpreted as probabilities.

Warning

Never use linear regression for classification tasks. There is no built-in mechanism that prevents linear regression from producing nonsencical probabiilty results.

Threshold Activation

In our second attempt to construct a classification algorithm we could the use original threshold activation function that was used in the McCulloch and Pitts neuron.

We could use the threshold of 5, which would mean that each sample with a feature value above 5 is classified into the "blue" category and the rest would be classified as the "red" category.

undefined

While this rule perfectly separates the data into the two categories, the threshold function is not differentiable. A non differentiable function would prevent us from applying gradient descent, which would limit our ability to learn optimal weights and biases.

Sigmoid

The sigmoid function \sigma(x) = \dfrac{1}{1 + e^{-x}} undefined is an S shaped function that is commonly used in machine learning to produce probabilites.

The sigmoid does not display problems that we faced with the two approaches above. \sigma(x) undefined is always bounded between 0 and 1, no matter how large or how negative the inputs are. This allows us to interpret the results as probabilities. The sigmoid is also a softer version of the threshold function. It smoothly changes between the probabilities. The function is therefore differentiable, which allows us to use gradient descent to learn the weights and biases.

Usually the output of 0.5 (50%) is regarded as the cutoff point. That would mean that inputs above 0.5 would be classified as category one and inputs below 0.5 would be classified as category 0.

In practice we combine linear regression with a sigmoid function, which forms the basis for logistic regression. The output of logistic regression is used as the input into the sigmoid.

Info

Logistic regression uses linear regression z = \mathbf{x} \mathbf{w}^T + b undefined as an input into the sigmoid \hat{y} = \dfrac{1}{1 + e^{-z}} undefined

This procedure allows us to learn parameters \mathbf{w} undefined and b undefined which align the true categories \mathbf{y} undefined with predicted probabilities \mathbf{\hat{y}} undefined . Below is an interactive example that lets you change the weight and the bias. Observe how probabilities change based on the inputs. Using both sliders you can move and rotate the probabilities as much as you want. Try to find parameters that would fit the data.

Weight 1 Bias 0

When we are dealing with a classification problem, we are trying to draw a decision boundary between the different classes in order to separate the data as good as possible. In the below example we have a classification problem with two features and two classes. We utilize logistic regression (the sigmoid function) with two weights w_1 undefined , w_2 undefined and the bias b undefined to draw a boundary. The boundary represents the exact cutoff, the 50% probability. On the one side of the boundary you would have \dfrac{1}{1 + e^{-(x_1w_1 + x_2w_2 + b)}} > 0.5 undefined , while on the other side of the boundary you have \dfrac{1}{1 + e^{-(x_1w_1 + x_2w_2 + b)}} < 0.5 undefined . By changing the weights and the bias you can rotate and move the decision boundary respectively.

Weight 1 -0.15 Weight 2 0.2 Bias -0.01

When we apply gradient descent to logistic regression, essentially we are adjusting the weights and the bias to shift the decision boundary.

Softmax

Before we move on to the next section, let us shortly discuss what function can be used if we are faced with more than two categories.

Let us assume, that we face a classification problem with d undefined possible categories. Our goal is to calculate the probabilities to belong to each of these categories. The softmax function takes a d undefined dimensional vector \mathbf{z} undefined and returns a vector of the same size that contains the corresponding probabilities.

undefined

If we had four categories for example, the results might look as follows.

undefined

Given these numbers, we would assume that it is most likely that the features belong to the category Nr. 3.

The values \mathbf{z} undefined that are used as input into the softmax function are called logits. You can imagine that each of the d undefined logits is a separate linear regression of the form z = \mathbf{x} \mathbf{w}^T + b undefined .

We calculate the probability for the k undefined of d undefined categories using the following softmax equation.

undefined

Similar to the sigmoid function, the softmax function has several advantageous properties. The equation for example guarantees, that the sum of probabilities is exactly 1, thus avoiding any violations of the law of probabilities. Additionally as the name suggest the function is "soft", which indicates that it is differentiable and can be used in gradient descent.