Imagine
for a moment that you live in a city with a problem. The gardeners in
your city have recently taken to putting garden gnomes in their gardens,
and they’re mind-bogglingly ugly. The gardeners have even started
having a competition around it, seeing who can place the biggest gnome
in their garden. Obviously, this isn’t going to stand.
They
mayor knows you’re an aspiring data scientist and approaches you for
help. He wants somebody to track the location of all of the gnomes, and
develop a way to predict their locations. It’s a ridiculous task, but
somebody has to do it.
The
mayor knew to ask you because you already created a data set
identifying the locations of all of the garden gnomes in the city when you were studying dimensionality reduction.
And as it turns out, the garden gnomes are all placed in a straight
line through the city. This means that you can create tools that tell
you the location of a garden gnome if you know where it’s located either
North-South, or East-West. The tools will tell you the other location.
This
is called Linear Regression. Linear Regression is the practice of
statistically calculating a straight line that demonstrated a
relationship between two different items. In this case the relationship
would be between the location of garden gnomes in the East-West
dimension, and the location of garden gnomes in the North-South
dimension. The result is a single equation empowering you to calculate
one if you know the other.
Note
that linear regression is the simplest form of regression there is
(Making it a good starting point!). There are two characteristics that
make it the simplest form of regression. First, it’s only capable of
capturing linear relationships. If there’s an exponential trend in your
data, better not use it. Logarithmic? Nope, not that either. There are
ways to massage the data to use linear regression on those data sets,
but it can’t be done straight away. Secondly, it’s only capable of
handling relationships between two variables. If you have more variables
than that in your data set, you need to start looking into multiple
regression instead. That said, since linear regression is the simplest
form of regression it’s a good starting point.
What is Linear Regression?
Linear
regression, at it’s core, is a way of calculating the relationship
between two variables. It assumes that there’s a direct correlation
between the two variables, and that this relationship can be represented
with a straight line.
These two variables are called the independent variable and the dependent variable, and they are given these names for fairly intuitive reasons. The independent variable
is so named because the model assumes that it can behave however it
likes, and doesn’t depend on the other variable for any reason. The dependent variable is the opposite; the model assumes that it is a direct result of the independent variable, it’s value is highly dependent on the independent variable.
Linear
regression creates a linear mathematical relationships between these
two variables. It enables calculation predicting the dependent variable if the dependent variable
is known. To bring this back to our somewhat ludicrous garden gnome
example, we could create a regression with the East-West location of the
garden gnome as the independent variable and the North-South location as the dependent variable. We could then calculate the North-South location of any gnome in the city so long as we know its East-West location.
Since
it’s such a simple form of regression, the governing equation for
linear regression is also quite simple. It takes the form:
y = B* x + A
where y is the dependent variable, x is the independent variable, and A and B are coefficients determining the slope and intercept of the equation.
How are the coefficients calculated?
Essentially,
the coefficients A and B are calculated to minimize the error between
the models predictions and the actual data in the training set (For
those who aren’t familiar with the term “training set” see How to Choose Between Multiple Models
for an explanation of why the original data set must be broken into
three different sets, what those sets are called, and what they’re used
for).
Keep in mind that error between the data and the predictions is calculated as follows:
Error = Actual — Prediction
Therefore,
minimizing the error between the model predictions and the actual data
means performing the following steps for each x value in your data set:
- First, use the linear regression equation, with values for A and B, to calculate predictions for each value of x;
- Second, calculate the error for each value of x by subtracting the prediction for that x from the actual, known data;
- Third, sum the error of all of the points to identify the total error from a linear regression equation using those values for A and B.
Those
are the basic steps. But keep in mind that some errors would be
positive, while others would be negative. These errors would cancel each
other out and bring the resulting error closer to 0, despite there
being error in both readings. Take for instance two points, one with
error of 5 and the other with error of -10. While we all know that both
these two points should be considered as causing 15 total points of
error, the method described above treats them as causing -5 points of
error. To overcome this problem, algorithms developing linear regression
models use the squared error instead of simply the error. In other words, the formula for calculating the error takes the form:
Error = (Actual — Prediction)²
Since
negative values squared will always return positive values, this
prevents the errors canceling each other out and making bad models
appear accurate.
Since the linear regression model minimizes the squared error the solution is referred to as the least squares solution. This is the name for the combination of A and B that return the minimum squared error
over the data set. Guessing and checking A and B would be extremely
tedious. Using an optimization algorithm is another possibility, but
would probably be time consuming and annoying. Fortunately,
mathematicians have found an algebraic solution to this problem. The least squares solution can be found using the following two equations:
B = correlation(x, y) * μ(y) / μ(x)
A = mean(y) — B * mean(x)
where
μ represents the standard deviations, mean represents the average or
mean of y values in the data set, and correlation is a value
representing the strength of correlation between the two. If you’re
doing this work in the python package pandas you’ll be able to use the DataFrame.mean() function to identify mean(y) and numpy has a function to find the correlation. For those who aren’t familiar with any of those terms, I recommend reading Python for Data Analysis as a way to get started.
The fact that these two equations return the least squares solution
isn’t incredibly intuitive, but we can make sense of it pretty quickly.
Look at the equation for A. It essentially states that we need a value
which returns the average value of y (The dependent variable) when when given the average value of x (The independent variable).
It’s trying to create a line that runs through the center of the data
set. Now look at the equation for B. It states that the value of the dependent variable y should change by the standard deviation of y times the correlation between the two variables when the value of the independent variable
changes by the standard deviation of x. This is a very wordy way of
saying that the two values each change by 1 standard deviation times the
correlation between the two.
Why does this work?
The least squares solution is typically used because of the maximum likelihood estimation. The details are a bit beyond the scope of this article, but a good explanation can be found in Data Science from Scratch. The maximum likelihood estimation
is based around identifying the value that is most likely to create a
data set. Imagine a data set based around a parameter Z. If you don’t
actually know what Z is, then you could search for a value of Z that
most likely yields the data set. It’s not saying that you’ve certainly
found the right value for Z, but you have found the value of Z that
makes the observed data set the most probable.
This
sort of calculation can be applied to each data point in the data set,
calculating the values of A and B that make the data set most probable.
If you run through the match, which is shown directly in Data Science from Scratch, you discover that the least squares solution for A and B also maximizes the maximum likelihood
for the data set. Which, as stated above, doesn’t prove that these are
the values driving the data set but does say that they’re the most
likely values.
How do I Know The Model Works Well?
As with all models, it is imperative that you test your model to ensure that it’s performing well.
This means comparing the model predictions to the actual data in the
training, validation, and testing data sets. The preferred method of
checking varies depending on the type of model, but for the linear
regression model this is typically done by calculating the coefficient of determination, or r² value. The coefficient of determination
captures how much of the trend in the data set can be correctly
predicted by the linear regression model. It’s a value ranging from 0 to
1, with lower values indicating worse fit and higher values indicating
better fit.
The coefficient of determination
is calculated based on the sum of squared errors divided by the total
squared variation of y values from their average value. That calculation
yields the fraction of variation in the dependent variable not captured by the model. Thus the coefficient of variation is 1 — that value. Or, in math terms:
r² = 1 — (Sum of squared errors) / (Total sum of squares)
(Total sum of squares) = Sum(y_i — mean(y))²
(Sum of squared errors) = sum((Actual — Prediction)²)
What are the limits of Linear Regression?
Just like all algorithms, there are limits to the performance of Linear Regression.
As
we’ve previously noted, the Linear Regression model is only capable of
return straight lines. This makes it wholly unsuited to match data sets
with any sort of curve in them, such as exponential or logarithmic
trends.
Secondly, linear regression only works when there’s a single dependent variable and a single independent variable.
If you want to include multiple of either of those in your data set,
you’ll need to use multiple regression (Which, fortunately, I will be
writing about soon!).
Finally,
be very careful not to use a linear regression model to predict values
outside of the range of your training data set. There’s no way to know
that the same trends hold outside of the training data set, and a very
different model may be needed to predict the behavior of the data set
outside of those ranges. Because of this uncertainty, extrapolation can
lead to some very inaccurate predictions.
Wrapping it up
Linear
regression is a simple tool to study the mathematical relationships
between two different variables. It can be used on simple data sets,
with linear relationships between two variables. One variables is
treated as the independent variable because the model assumes that changes in the other variable don’t impact it. The other variable is treated as the dependent variable because the model assumes that it’s values are dependent on the independent variable.
To create a linear regression model, you need to find the terms A and B that provide the least squares solution, or that minimize the sum of the squared error over all dependent variable points in the data set. This can be done using a few equations, and the method is based on the maximum likelihood estimation.
As
with all algorithms, it’s critical that you check the performance of a
linear regression model against the training, validation, and testing
data sets. In the case of linear regression, that fit of the model is
typically tested and calculated using the coefficient of determination. The coefficient of determination, often presented as r², presents how much of the trend in the dependent variable can be predicted by the model, and how much of that trend cannot be predicted by the model.
There
are several limitations to be aware of when using linear regression
models. First, never use linear regression if the trend in the data set
appears to be curved; no matter how hard you try, a linear model will
not fit a curved data set. Second, linear regression is only capable of
handling a single dependent variable and a single independent variable.
If you have multiple variables you want to consider, you need to use
multiple regression instead. And, finally, never use a linear regression
model to extrapolate beyond the bounds of the training data set. You
don’t know the trend of the data outside the training data set, and
extrapolations opens you up to wild prediction errors.
No comments:
Post a Comment