Tuesday, April 3, 2018

Boosting and Bagging: How To Develop A Robust Machine Learning Algorithm

https://towardsdatascience.com/how-to-develop-a-robust-algorithm-c38e08f32201


Bootstrapping/Bagging/Boosting

Machine learning and data science require more than just throwing data into a python library and utilizing whatever comes out.
Data scientists need to actually understand the data and the processes behind the data to be able to implement a successful system.
One key methodology to implementation is knowing when a model might benefit from utilizing bootstrapping methods. These are what are called ensemble models. Some examples of ensemble models are AdaBoost and Stochastic Gradient Boosting.
Why use ensemble models?
They can help improve algorithm accuracy or improve the robustness of a model. Two examples of this is boosting and bagging. Boosting and Bagging are must know topics for data scientists and machine learning engineers. Especially if you are planning to go in for a data science/machine learning interview.
Essentially, ensemble learning follows true to the word ensemble. Except, instead of having several people who are singing at different octaves to create one beautiful harmony (each voice filling in the void of the other). Ensemble learning uses hundreds to thousands of models of the same algorithm that work together to find the correct classification.
Another way to think about ensemble learning is the fable of the blind men and the elephant. Where each blind man found a feature of the elephant and they all thought it was something different. However, had they come together and discussed it, they might have been able to figure out what they were looking at.
Using techniques like boosting and bagging has lead to increased robustness of statistical models and decreased variance.
Now the question becomes, with all these different “B” words. What is the difference?

Bootstrapping

Let’s first talk about the very important concept of bootstrapping. This can occasionally be missed as many data scientists will go straight to explaining boosting and bagging. Both of which require bootstrapping.
Figure 1 Bootstrapping
In machine learning, the bootstrap method refers to random sampling with replacement. This sample is referred to as a resample. This allows the model or algorithm to get a better understanding of the various biases, variances and features that exist in the resample. Taking a sample of the data allows the resample to contain different characteristics then it might have contained as a whole. Demonstrated in figure 1 where each sample population has different pieces, and none are identical. This would then affect the overall mean, standard deviation and other descriptive metrics of a data set. In turn, it can develop more robust models.
Bootstrapping is also great for small size data sets that can have a tendency to overfit. In fact, we recommended this to one company who was concerned because their data sets were far from “Big Data”. Bootstrapping can be a solution in this case because algorithms that utilize bootstrapping can be more robust and handle new data sets depending on the methodology chosen(boosting or bagging)
The reason to use the bootstrap method is because it can test the stability of a solution. By using multiple sample data sets and then testing multiple models, it can increase robustness. Perhaps one sample data set has a larger mean than another, or a different standard deviation. This might break a model that was overfit, and not tested using data sets with different variations.
One of the many reasons bootstrapping has become very common is because of the increase in computing power. This allows for many times more permutations to be done with different resamples than previously. Bootstrapping is used in both Bagging and Boosting, as will be discussed below.

Bagging

Bagging actually refers to (Bootstrap Aggregators). Most any paper or post that references using bagging algorithms will also reference Leo Breiman who wrote a paper in 1996 called “Bagging Predictors”.
Where Leo describes bagging as:
“Bagging predictors is a method for generating multiple versions of a predictor and using these to get an aggregated predictor.”
What Bagging does is help reduce variance from models that are might be very accurate, but only on the data they were trained on. This is also known as overfitting.
Overfitting is when a function fits the data too well. Typically this is because the actual equation is much too complicated to take into account each data point and outlier.
Figure 2 Overfitting
Another example of an algorithm that can overfit easily is a decision tree. The models that are developed using decision trees require very simple heuristics. Decision trees are composed of a set of if-else statements done in a specific order. Thus, if the data set is changed to a new data set that might have some bias or difference in spread of underlying features compared to the previous set. The model will fail to be as accurate. This is because the data will not fit the model as well(which is a backwards statement anyways).
Bagging gets around this by creating it’s own variance amongst the data by sampling and replacing data while it tests multiple hypothesis(models). In turn, this reduces the noise by utilizing multiple samples that would most likely be made up of data with various attributes(median, average, etc).
Once each model has developed a hypothesis. The models use voting for classification or averaging for regression. This is where the “Aggregating” in “Bootstrap Aggregating” comes into play. Each hypothesis has the same weight as all the others. When we later discuss boosting, this is one of the places the two methodologies differ.
Figure 3 Bagging
Essentially, all these models run at the same time, and vote on which hypothesis is the most accurate.
This helps to decrease variance i.e. reduce the overfit.

Boosting

Boosting refers to a group of algorithms that utilize weighted averages to make weak learners into stronger learners. Unlike bagging that had each model run independently and then aggregate the outputs at the end without preference to any model. Boosting is all about “teamwork”. Each model that runs, dictates what features the next model will focus on.
Boosting also requires bootstrapping. However, there is another difference here. Unlike in bagging, boosting weights each sample of data. This means some samples will be run more often than others.
Why put weights on the samples of data?
Figure 4 Boosting
When boosting runs each model, it tracks which data samples are the most successful and which are not. The data sets with the most misclassified outputs are given heavier weights. These are considered to be data that have more complexity and requires more iterations to properly train the model.
During the actual classification stage, there is also a difference in how boosting treats the models. In boosting, the model’s error rates are kept track of because better models are given better weights.
That way, when the “voting” occurs, like in bagging, the models with better outcomes have a stronger pull on the final output.

Summary

Boosting and bagging are both great techniques to decrease variance. Ensemble methods generally out perform a single model. This is why many of the Kaggle winners have utilized ensemble methodologies. One that was not discussed here was stacking. However, that requires it’s own post.
However, they won’t fix every problem, and they themselves have their own problems. There are different reasons you would use one over the other. Bagging is great for decreasing variance when a model is overfit. However, boosting is much more likely to be a better pick of the two methods. Boosting also is much more likely to cause performance issues. It is also great for decreasing bias in an underfit model.
This is where experience and subject matter expertise comes in! It can be easy to jump on the first model that works. However, it is important to analyze the algorithm and all the features it selects. For instance, if a decision tree sets specific leafs, the question becomes, why! If you can’t support it with other data points and visuals, it probably shouldn’t be implemented.
It is not just about trying AdaBoost, or Random forests on various data sets. Depending on the results an algorithm is getting, and what support is there, drives the final algorithm

Call To Action

Do you need a team of experienced data specialist to come in and help develop a data driven team or design and integrate a new system? Reach out to our team today!
Are you interested in learning more about data science and machine learning?
This article is slightly modified from my original which can be found here!



No comments:

Post a Comment