Mean Squared Error
While we can intuitively tell how far away our line is from some optimal location, we need a quantitative measure that the linear regression algorithm can use for optimization purposes. In machine learning we use a measure called loss function (also called error function or cost function). The lower the loss function, the closer we are to our goal. We can tweak the weights and biases to find a minimum value of the loss function.
To make our illustrations easier, we will use a dataset with only 4 samples, as shown in the illustration below.
We will start by drawing a 45 degree regression line from the (0,0) to the (60, 60) position. While this looks "OK", we do not have a way to compare that particular line with any other lines.
The first step is to calculate the error between the actual target y^{(i)} undefined and the predicted value \hat{y}^{(i)}=x^{(i)}w + b undefined for each datapoint i undefined . We can define the difference y^{(i)} - \hat{y}^{(i)} undefined as the error. Visually we can draw that error as the vertical line that connects the regression line with the true target.
Depending on the location and rotation of the regression line \hat{y} = xw + b undefined and the actual target y^{(i)} undefined , the target might be above or below the regression line and thus the error can either be positive or negative. If we tried to sum up all the errors in the dataset \sum_i^n y^{(i)} - \hat{y}^{(i)} undefined the positive and the negative errors would offset each other. If the errors were symmetrical above and below the line, we could end up with a summed error of 0. This is not what we want. We want positive and negative errors to contribute equally to our loss measure, without offsetting each other.
The loss that is actually used in linear regression is called the mean squared error (MSE). The MSE takes each of the individual errors to the power of 2, (y^{(i)} - \hat{y}^{(i)})^2 undefined , to get rid of the negative sign. The average of the errors is defined as the mean squared error.
Info
MSE=\dfrac{1}{n}\sum_i^n (y^{(i)} - \hat{y}^{(i)} )^2 undefined
Remember that we should try to express all operations in matrix notation in order to make use of parallelization. For the mean squared error that would look as follows.
Info
\mathbf{\hat{y}} = \mathbf{Xw}^T \\ MSE=mean\Big(\big[\mathbf{y} - \mathbf{\hat{y}}\big]^2\Big) \\ undefined
In the first step we calculate many predictions \mathbf{\hat{y}} undefined simultaneously and calculate the vector of errors \mathbf{y - \hat{y}} undefined . When we square the resulting vector, that implies that each individual unit within the vector is multiplied by itself in parallel. Finally we calculate the mean of the vector. For that purpose deep learning libraries provide a mean() operation, that takes a vector as input and generates a mean from all individual scalars within that vector.
We can visuallize the mean squared error by drawing actual squares. Each data point has a corresponding square and the larger the area of that square. Try to use the example below and move the weight and the bias. Observe how the mean squared error changes based on the parameters.
Different combination of the weight w undefined and the bias b undefined produce different losses and our job is to find the combination that minimizes that loss function. Obviously it makes no sense to search manually for those parameters. The next section is therefore dedicated to a procedure that is commonly used to find the mimimum loss.