Friday, August 9, 2019

Understanding the Fundamentals of Linear Regression

https://towardsdatascience.com/understanding-the-fundamentals-of-linear-regression-7e64afd614e1

Imagine for a moment that you live in a city with a problem. The gardeners in your city have recently taken to putting garden gnomes in their gardens, and they’re mind-bogglingly ugly. The gardeners have even started having a competition around it, seeing who can place the biggest gnome in their garden. Obviously, this isn’t going to stand.
They mayor knows you’re an aspiring data scientist and approaches you for help. He wants somebody to track the location of all of the gnomes, and develop a way to predict their locations. It’s a ridiculous task, but somebody has to do it.
The mayor knew to ask you because you already created a data set identifying the locations of all of the garden gnomes in the city when you were studying dimensionality reduction. And as it turns out, the garden gnomes are all placed in a straight line through the city. This means that you can create tools that tell you the location of a garden gnome if you know where it’s located either North-South, or East-West. The tools will tell you the other location.
This is called Linear Regression. Linear Regression is the practice of statistically calculating a straight line that demonstrated a relationship between two different items. In this case the relationship would be between the location of garden gnomes in the East-West dimension, and the location of garden gnomes in the North-South dimension. The result is a single equation empowering you to calculate one if you know the other.
Note that linear regression is the simplest form of regression there is (Making it a good starting point!). There are two characteristics that make it the simplest form of regression. First, it’s only capable of capturing linear relationships. If there’s an exponential trend in your data, better not use it. Logarithmic? Nope, not that either. There are ways to massage the data to use linear regression on those data sets, but it can’t be done straight away. Secondly, it’s only capable of handling relationships between two variables. If you have more variables than that in your data set, you need to start looking into multiple regression instead. That said, since linear regression is the simplest form of regression it’s a good starting point.

What is Linear Regression?

Linear regression, at it’s core, is a way of calculating the relationship between two variables. It assumes that there’s a direct correlation between the two variables, and that this relationship can be represented with a straight line.
These two variables are called the independent variable and the dependent variable, and they are given these names for fairly intuitive reasons. The independent variable is so named because the model assumes that it can behave however it likes, and doesn’t depend on the other variable for any reason. The dependent variable is the opposite; the model assumes that it is a direct result of the independent variable, it’s value is highly dependent on the independent variable.
Linear regression creates a linear mathematical relationships between these two variables. It enables calculation predicting the dependent variable if the dependent variable is known. To bring this back to our somewhat ludicrous garden gnome example, we could create a regression with the East-West location of the garden gnome as the independent variable and the North-South location as the dependent variable. We could then calculate the North-South location of any gnome in the city so long as we know its East-West location.
Since it’s such a simple form of regression, the governing equation for linear regression is also quite simple. It takes the form:
y = B* x + A
where y is the dependent variable, x is the independent variable, and A and B are coefficients determining the slope and intercept of the equation.

How are the coefficients calculated?

Essentially, the coefficients A and B are calculated to minimize the error between the models predictions and the actual data in the training set (For those who aren’t familiar with the term “training set” see How to Choose Between Multiple Models for an explanation of why the original data set must be broken into three different sets, what those sets are called, and what they’re used for).
Keep in mind that error between the data and the predictions is calculated as follows:
Error = Actual — Prediction
Therefore, minimizing the error between the model predictions and the actual data means performing the following steps for each x value in your data set:
  • First, use the linear regression equation, with values for A and B, to calculate predictions for each value of x;
  • Second, calculate the error for each value of x by subtracting the prediction for that x from the actual, known data;
  • Third, sum the error of all of the points to identify the total error from a linear regression equation using those values for A and B.
Those are the basic steps. But keep in mind that some errors would be positive, while others would be negative. These errors would cancel each other out and bring the resulting error closer to 0, despite there being error in both readings. Take for instance two points, one with error of 5 and the other with error of -10. While we all know that both these two points should be considered as causing 15 total points of error, the method described above treats them as causing -5 points of error. To overcome this problem, algorithms developing linear regression models use the squared error instead of simply the error. In other words, the formula for calculating the error takes the form:
Error = (Actual — Prediction)²
Since negative values squared will always return positive values, this prevents the errors canceling each other out and making bad models appear accurate.
Since the linear regression model minimizes the squared error the solution is referred to as the least squares solution. This is the name for the combination of A and B that return the minimum squared error over the data set. Guessing and checking A and B would be extremely tedious. Using an optimization algorithm is another possibility, but would probably be time consuming and annoying. Fortunately, mathematicians have found an algebraic solution to this problem. The least squares solution can be found using the following two equations:
B = correlation(x, y) * μ(y) / μ(x)
A = mean(y) — B * mean(x)
where μ represents the standard deviations, mean represents the average or mean of y values in the data set, and correlation is a value representing the strength of correlation between the two. If you’re doing this work in the python package pandas you’ll be able to use the DataFrame.mean() function to identify mean(y) and numpy has a function to find the correlation. For those who aren’t familiar with any of those terms, I recommend reading Python for Data Analysis as a way to get started.
The fact that these two equations return the least squares solution isn’t incredibly intuitive, but we can make sense of it pretty quickly. Look at the equation for A. It essentially states that we need a value which returns the average value of y (The dependent variable) when when given the average value of x (The independent variable). It’s trying to create a line that runs through the center of the data set. Now look at the equation for B. It states that the value of the dependent variable y should change by the standard deviation of y times the correlation between the two variables when the value of the independent variable changes by the standard deviation of x. This is a very wordy way of saying that the two values each change by 1 standard deviation times the correlation between the two.

Why does this work?

The least squares solution is typically used because of the maximum likelihood estimation. The details are a bit beyond the scope of this article, but a good explanation can be found in Data Science from Scratch. The maximum likelihood estimation is based around identifying the value that is most likely to create a data set. Imagine a data set based around a parameter Z. If you don’t actually know what Z is, then you could search for a value of Z that most likely yields the data set. It’s not saying that you’ve certainly found the right value for Z, but you have found the value of Z that makes the observed data set the most probable.
This sort of calculation can be applied to each data point in the data set, calculating the values of A and B that make the data set most probable. If you run through the match, which is shown directly in Data Science from Scratch, you discover that the least squares solution for A and B also maximizes the maximum likelihood for the data set. Which, as stated above, doesn’t prove that these are the values driving the data set but does say that they’re the most likely values.

How do I Know The Model Works Well?

As with all models, it is imperative that you test your model to ensure that it’s performing well. This means comparing the model predictions to the actual data in the training, validation, and testing data sets. The preferred method of checking varies depending on the type of model, but for the linear regression model this is typically done by calculating the coefficient of determination, or r² value. The coefficient of determination captures how much of the trend in the data set can be correctly predicted by the linear regression model. It’s a value ranging from 0 to 1, with lower values indicating worse fit and higher values indicating better fit.
The coefficient of determination is calculated based on the sum of squared errors divided by the total squared variation of y values from their average value. That calculation yields the fraction of variation in the dependent variable not captured by the model. Thus the coefficient of variation is 1 — that value. Or, in math terms:
r² = 1 — (Sum of squared errors) / (Total sum of squares)
(Total sum of squares) = Sum(y_i — mean(y))²
(Sum of squared errors) = sum((Actual — Prediction)²)

What are the limits of Linear Regression?

Just like all algorithms, there are limits to the performance of Linear Regression.
As we’ve previously noted, the Linear Regression model is only capable of return straight lines. This makes it wholly unsuited to match data sets with any sort of curve in them, such as exponential or logarithmic trends.
Secondly, linear regression only works when there’s a single dependent variable and a single independent variable. If you want to include multiple of either of those in your data set, you’ll need to use multiple regression (Which, fortunately, I will be writing about soon!).
Finally, be very careful not to use a linear regression model to predict values outside of the range of your training data set. There’s no way to know that the same trends hold outside of the training data set, and a very different model may be needed to predict the behavior of the data set outside of those ranges. Because of this uncertainty, extrapolation can lead to some very inaccurate predictions.

Wrapping it up

Linear regression is a simple tool to study the mathematical relationships between two different variables. It can be used on simple data sets, with linear relationships between two variables. One variables is treated as the independent variable because the model assumes that changes in the other variable don’t impact it. The other variable is treated as the dependent variable because the model assumes that it’s values are dependent on the independent variable.
To create a linear regression model, you need to find the terms A and B that provide the least squares solution, or that minimize the sum of the squared error over all dependent variable points in the data set. This can be done using a few equations, and the method is based on the maximum likelihood estimation.
As with all algorithms, it’s critical that you check the performance of a linear regression model against the training, validation, and testing data sets. In the case of linear regression, that fit of the model is typically tested and calculated using the coefficient of determination. The coefficient of determination, often presented as r², presents how much of the trend in the dependent variable can be predicted by the model, and how much of that trend cannot be predicted by the model.
There are several limitations to be aware of when using linear regression models. First, never use linear regression if the trend in the data set appears to be curved; no matter how hard you try, a linear model will not fit a curved data set. Second, linear regression is only capable of handling a single dependent variable and a single independent variable. If you have multiple variables you want to consider, you need to use multiple regression instead. And, finally, never use a linear regression model to extrapolate beyond the bounds of the training data set. You don’t know the trend of the data outside the training data set, and extrapolations opens you up to wild prediction errors.

No comments:

Post a Comment