Linear Model

The term "linear regression" consists of two words, that fully describe the type of model we are dealing with: linear and regression. The "regression" part signifies that our model predicts a numeric target variable based on given features and we are not dealing with a classification task. The "linear" part suggests that linear regression can only model a linear relationship between features and targets. To clarify what the words "linear relationship" mean we present two examples below.

In the first scatterplot we could plot a line that goes from the coordinates of (-100, -500) to coordinates (100, 500). While there is some randomness in the data, the line would depict the relationship between the feature and the target relatively well. When we get new data points we can use the line to predict the target and be relatively confident regarding the outcome.

In contrast the data in the following scatterplot represents a nonlinear relationship between the feature and the target. Theoretically there is nothing that stops us from using linear regression for the below problem, but there are better alternatives (like neural networks) for non linear problems.

From basic math we know, that in the two dimensional space we can draw a line using the equation y = xw + b undefined , where x undefined is the only feature, y undefined is the target, w undefined is the weight that we use to scale the feature and b undefined is the bias. While we can easily understand that the feature x undefined is the input of our equation and the label y undefined is the output of the equation, we have a harder time imagining what role the weight w undefined and the bias b undefined play in the equation. Below we present two possible interpretations.

When we look at the equation y = xw + b undefined from the arithmetic perspective, we should notice two things. First, the output y undefined equals the bias when the input x undefined is 0: y = 0w + b undefined . The bias in a way encompasses a starting point for the calculation of the output. If for example we tried to model the relationship between age and height, even at birth (age 0) a human would have some average height, which would be encoded in the bias b undefined . Second, for each unit of x undefined , the output increases by exactly w undefined . The equation y = x*5cm + 50cm undefined would indicate that on average a human grows by 5cm for each year in life. At this point you would hopefully interject that this relation is out of touch with reality. For once the equation does not reflect that a human being grows up to a certain length or that a child grows at a higher rate, than a young adult. At a certain age people even start to shrink. While all these points are valid, we make specific assumtions, when we model the world using linear regression.

Warning

When we use a linear regression model, we assume a linear relationship between the inputs and the output. If you apply linear regression to data that is nonlinear in nature, you might get illogical results.

When on the other hand we look at the equation y = xw + b undefined from the geometric perspective, we should realize, that weight determines the rotation (slope) of the line while the bias determines the horizontal position. Below we present an interactive example to demonstrate the impact of the weight and the bias on the the regression line. You can move the two sliders to change the weight and the bias. Observe what we mean when we say rotation and position. Try to position the line, such that it fits the data as good as possible.

Weight -100 Bias -200

We used the weight w undefined of 5 and the bias b undefined of 0 plus some randomness to generate the data above. When you played with sliders you should have come relatively close.

The weight and the bias are learnable parameters. The linear regression algorithm provides us with a way to find those parameters. You can imagine that the algorithm rotates and moves the line, until the line fits the data. This process is called data or curve fitting.

In practice we rarely deal with a dataset where we only have one feature. In that case our equation looks as follows.

undefined

We can also use a more compact form and write the equation in vector form.

undefined

In a three dimensional space we calculate a two dimensional plane that divides the coordinate system into two regions. This procedure is harder to imagine for more than 3 dimensions, but we still create a plane (a so called hyperplane) in space. The weights are used to rotate the hyperplane while the bias moves the plane.

When we use linear regression to make predictions based on features, we draw a "hat" over the y undefined value to indicate that we are dealing with a prediction from a model, \hat{y} = \mathbf{x} \mathbf{w}^T + b undefined . The y undefined value on the other hand represents the actual target from the dataset, the so called ground truth. Usually we want to create predictions not for a single sample \mathbf{x} undefined , but for a whole dataset \mathbf{X} undefined .

undefined

\mathbf{X} undefined is an n \times m undefined matrix, where n undefined (rows) is the number of samples and m undefined (columns) is the number of input features. We can multiply the dataset matrix \mathbf{X} undefined with the transposed weight vector\mathbf{w} undefined and add the bias b undefined to generate a prediction vector \mathbf{\hat{y}} undefined .

undefined

The advantage of the above procedure is not only due to a more compact representation, but has also practical implications. Matrix operations in all modern deep learning frameworks can be parallelized. Therefore when you utilize matrix notation in your code, you actually make use of that parallelism and can speed up your code tremendously. Think about it. Each row of the dataset can be multiplied with the weight vector independently. By outsourcing the calculations to different CPU or GPU cores, a lot of computation time can be saved.

By this point you might have noticed, that there is something fishy about the expression.

undefined

On the one side we have a vector that results from \mathbf{Xw}^T undefined , on the other side we have a scalar b undefined . From a mathematical standpoint adding a scalar to a vector is techincally not allowed. From the programming standpoint this procedure is valid, because NumPy and all deep leanring frameworks utilize a technique called broadcasting. We will have a closer look at broadcasting in our practical sessions, for now it is sufficient to know, that broadcasting expands scalars, vectors and matrices in order for the calculations to make sense. In our example above for example, the scalar would be expanded into a vector, which would be of the same size as the vector that results from \mathbf{Xw}^T undefined . We will often include notation that incorporates broadcasting in order to make the notation more similar to our Python code.

Now let's see how we can impelemnt this idea of a linear model in PyTorch.

import torch
import sklearn.datasets as datasets

We make use of the make_regression() function from the sklearn library to make a dataset with 100 samples and 2 features.

X, y = datasets.make_regression(n_samples=100, n_features=2, n_informative=2, noise=0.01)

The above function returns numpy arrays \mathbf{X} undefined and \mathbf{y} undefined and we transform those into PyTorch tensors.

X = torch.from_numpy(X).to(torch.float32)
y = torch.from_numpy(y).to(torch.float32)

We initialze the two weights and the bias randomly, using the torch.randn() function. This function returns random variables, that are drawn from the standard normal distribution.

w = torch.randn(1, 2)
b = torch.randn(1, 1)

The actual model predictions can be calculated using a one liner.

y_hat = X @ w.T + b
y_hat.shape

torch.Size([100, 1])

While it is relatively easy to use a linear model in PyTorch, we have still not encountered any methods to generate predictions that are as close to the true labels in the dataset as possible. In the next sections we are going to cover how the learning procedure actually works.