Wednesday, June 27, 2018

Why do stacked ensemble models win data science competitions?

https://blogs.sas.com/content/subconsciousmusings/2017/05/18/stacked-ensemble-models-win-data-science-competitions/

Ensemble methods are commonly used to boost predictive accuracy by combining the predictions of multiple machine learning models. The traditional wisdom has been to combine so-called “weak” learners. However, a more modern approach is to create an ensemble of a well-chosen collection of strong yet diverse models.
Building powerful ensemble models has many parallels with building successful human teams in business, science, politics, and sports. Each team member makes a significant contribution and individual weaknesses and biases are offset by the strengths of other members.
The simplest kind of ensemble is the unweighted average of the predictions of the models that form a model library. For example, if a model library includes three models for an interval target (as shown in the following figure), the unweighted average would entail dividing the sum of the predicted values of the three candidate models by three. In an unweighted average, each model takes the same weight when an ensemble model is built.
Averaging predictions to form ensemble models.
More generally, you can think about using weighted averages. For example, you might believe that some of the models are better or more accurate and you want to manually assign higher weights to them. But an even better approach might be to estimate these weights more intelligently by using another layer of learning algorithm. This approach is called model stacking.
Model stacking is an efficient ensemble method in which the predictions, generated by using various machine learning algorithms, are used as inputs in a second-layer learning algorithm. This second-layer algorithm is trained to optimally combine the model predictions to form a new set of predictions. For example, when linear regression is used as second-layer modeling, it estimates these weights by minimizing the least square errors. However, the second-layer modeling is not restricted to only linear models; the relationship between the predictors can be more complex, opening the door to employing other machine learning algorithms.
Model stacking uses a second-level algorithm to estimate prediction weights in the ensemble model.

Winning data science competitions with ensemble modeling

Ensemble modeling and model stacking are especially popular in data science competitions, in which a sponsor posts a training set (which includes labels) and a test set (which does not include labels) and issues a global challenge to produce the best predictions of the test set for a specified performance criterion. The winning teams almost always use ensemble models instead of a single fine-tuned model. Often individual teams develop their own ensemble models in the early stages of the competition, and then join their forces in the later stages.
On the popular data science competition site Kaggle you can explore numerous winning solutions through its discussion forums to get a flavor of the state of the art. Another popular data science competition is the KDD Cup. The following figure shows the winning solution for the 2015 competition, which used a three-stage stacked modeling approach.

The figure shows that a diverse set of 64 single models were used to build the model library. These models are trained by using various machine learning algorithms. For example, the green boxes represent gradient boosting models (GBM), pink boxes represent neural network models (NN), and orange boxes represent factorization machines models (FM). You can see that there are multiple gradient boosting models in the model library; they probably vary in their use of different hyperparameter settings and/or feature sets.
At stage 1, the predictions from these 64 models are used as inputs to train 15 new models, again by using various machine learning algorithms. At stage 2 (ensemble stacking), the predictions from the 15 stage 1 models are used as inputs to train two models by using gradient boosting and linear regression. At stage 3 ensemble stacking (the final stage), the predictions of the two models from stage 2 are used as inputs in a logistic regression (LR) model to form the final ensemble.
In order to build a powerful predictive model like the one that was used to win the 2015 KDD Cup, building a diverse set of initial models plays an important role! There are various ways to enhance diversity such as using:
  • Different training algorithms.
  • Different hyperparameter settings.
  • Different feature subsets.
  • Different training sets.
A simple way to enhance diversity is to train models by using different machine learning algorithms. For example, adding a factorization model to a set of tree-based models (such as random forest and gradient boosting) provides a nice diversity because a factorization model is trained very differently than decision tree models are trained. For the same machine learning algorithm, you can enhance diversity by using different hyperparameter settings and subsets of variables. If you have many features, one efficient method is to choose subsets of the variables by simple random sampling. Choosing subsets of variables could be done in more principled fashion that is based on some computed measure of importance which introduces the large and difficult problem of feature selection.
In addition to using various machine learning training algorithms and hyperparameter settings, the KDD Cup solution shown above uses seven different feature sets (F1-F7) to further enhance the diversity.  Another simple way to create diversity is to generate various versions of the training data. This can be done by bagging and cross validation.

How to avoid overfitting stacked ensemble models

Overfitting is an omnipresent concern in building predictive models, and every data scientist needs to be equipped with tools to deal with it. An overfitting model is complex enough to perfectly fit the training data, but it generalizes very poorly for a new data set. Overfitting is an especially big problem in model stacking, because so many predictors that all predict the same target are combined. Overfitting is partially caused by this collinearity between the predictors.
The most efficient techniques for training models (especially during the stacking stages) include using cross validation and some form of regularization. To learn how we used these techniques to build stacked ensemble models, see our recent SAS Global Forum paper, "Stacked Ensemble Models for Improved Prediction Accuracy." That paper also shows how you can generate a diverse set of models by various methods (such as forests, gradient boosted decision trees, factorization machines, and logistic regression) and then combine them with stacked ensemble techniques such regularized regression methods, gradient boosting, and hill climbing methods.
The following image provides a simple summary of our ensemble approach.  The complete model building approach is explained in detail in the paper. A computationally intense process such as this benefits greatly by running in a distributed execution environment offered in the SAS® Viya platform by using SAS® Visual Data Mining and Machine Learning.
A diverse set of models combined with stacked ensemble techniques.
Applying stacked models to real-world big data problems can produce greater prediction accuracy and robustness than do individual models. The model stacking approach is powerful and compelling enough to alter your initial data mining mindset from finding the single best model to finding a collection of really good complementary models.
Of course, this method does involve additional cost both because you need to train a large number of models and because you need to use cross validation to avoid overfitting. However, SAS Viya provides a modern environment that enables you to efficiently handle this computational expense and manage an ensemble workflow by using parallel computation in a distributed framework.
To learn more, check out our paper, "Stacked Ensemble Models for Improved Prediction Accuracy," and  read the SAS Visual Data Mining and Machine Learning documentation.

Data Science at Toyota Connected

https://towardsdatascience.com/data-science-at-toyota-connected-69bf50982b09

Presented at Data Science Salon in Dallas by Brian Kursar, Vice President and Chief Data Scientist at Toyota Connected.
Data is everywhere: in our digital footprint, in our financial system, and in our cities, down to the very cars we drive. Vehicles have become increasingly smart, and are now rich sources of data from which to derive valuable insights about customer behavior. We were thrilled to be joined by Brian Kursar, Vice President and Chief Data Scientist at Toyota Connected at the Data Science Salon in Dallas. He imagines a future where cars continue to add value to their drivers far after they leave the dealership. Here’s the transcript, from his fascinating talk.
I’m very excited to talk to everyone about Toyota Connected! First off let me just ask who here has heard of Toyota Connected. No, not Toyota — Toyota Connected. Wow, very very cool! For those of you who don’t know, Toyota Connected is a brand new company. We are about three minutes away from the corporate office. And we are a for-profit company. What we do is we are the arm for data science and data engineering for Toyota. We are a start-up and a start-up in the sense that we truly are a startup — we have about 200 engineers, we work in a different building, our culture is completely different than Toyota Motors in North America but we are powered by Toyota. We have a lot of the backing by the parent company and that really allows us to do a lot of things [that are] very innovative [in] a very different type of culture where we’re empowering our team with the ability to make decisions and to follow through in those decisions. As I mentioned, it’s a completely different office and if you walk into our office you will notice we have a dog-friendly policy and we have free lunches for everyone. As a matter of fact one of my favorite things about Toyota Connected is they actually label the vegan soups which is a something that makes me very happy.
Let me talk a little bit about where we see things and where Toyota Connected fits in. We see that the car is really an essential piece of the internet of everything. You really start out with the Toyota Connected car. And what is that? Behind the scenes every Toyota Connected car coming out since July 2017 is able to transmit sensors of data representing various things such as whether or not the windows are down, GPS, speedometer, odometer, steering angle. But all of these we do only with the consent of the customer. These vehicles actually are dead when you go to the dealership. But we will actually walk you through some of the various use cases for our safety connect program and this is what I’ll be talking about today. However we only enable it once the customer understands what data we’re collecting and how we’re using that data to create new and exciting services for them. The average customer drives about 48 minutes per day — about 500+ unique data points are generated every 200 milliseconds and that really comes out to about 7.2 million data points per connected vehicle per day [that’s] A LOT OF DATA! Petabytes of data! What do we do with it? First and foremost as I mentioned earlier [we] write data services that drive customer satisfaction, we’re looking to create new and exciting services that make driving safer, more convenient, and fun. Next we’ll use that data to really derive new insights to make our products better.
In the very short time that Toyota Connected has been around, for two years now — we’ve had a number of milestones. In April 2017, we worked with the folks in Tour to connect [with] Japan on a project called Japan Taxi which I’ll be talking about in a moment. The connected car went live in July of 2017 and that was for the model year Camry 2018. We then looked at actually using that data in what we call a car share pilot on the Island of Hawaii with a company that does distribution for our vehicles there. And then finally we went live with the car share pilot we now called Hui as well as going forward with Avis to be able to connect the vehicles in their rental problem transactions. Japan Taxi was our first really deep dive into the connected car because this was done before vehicles in the U.S. were connected. Actually this was done before vehicles in the US or Japan were connected. For this pilot we teamed up with TRI. For those who don’t know, TRI is the research arm of Toyota that focuses on the autonomous vehicle. This was an opportunity to really work with them to collect data and provide them data from actual people driving taxis in Japan. With this project, we used special aftermarket devices for eight hours a day every day — and actually it’s still going now — we are collecting the data from these trips. If you look here, this is actually a really quick video of our application that we created. In this application, each dot represents a vehicle and all this is collected in real time — you can actually drill in to one of the vehicles and then see the vehicles driving. This one here is driving late at night (our morning, night in Japan) on the streets.
What do we use this for? This is actually what we’re doing: leveraging machine learning, optic recognition and then providing that and those videos to Tour, the research institute, to be able to take what they’re finding and create their own algorithms to improve the autonomous vehicle. Another thing we do is outside of research, we look at new ways to provide new services for our customers. One of the things is we’re developing a driver score that’s gonna be live probably in the next four to six months. Actually Demuth’s working on it — he’s sitting in the back there and he can talk to you more about that if you’re interested. Here’s what a driving score is: we have a set of metrics or rules that Nitsa provides and to enable us to take what we call CANbus data or data that’s coming out of the vehicle and derive insights and scores on different types of events. Primarily we’re looking at longitudinal and lateral g-force, you’re looking at the location, speed, and then how much you’re applying on brake pressure. To give you an example [let’s] really drill down into four trips. The Green will represent what we call smooth driving. I mean not going past the speed limit, you’re not doing the harsh braking, you’re not having any harsh right turns or left turns or over speeding. The red there is what we would call harsh braking. Then you got that maroon which I can’t don’t think you see well in here, which represents over speeding. Above the line there you’ve got the horsepower acceleration, and here I don’t think we have any hard left turns. What does this look like? They’re actually drilling down even further, so here’s one of those trips, and as you can see here the green just drills down. The green shows that the person currently driving here is driving 37.96 miles per hour, the location at that point is 40 miles per hour therefore it’s green. There’s no harsh braking, there’s no longitude or added lateral g-force popping out, and as you can see for the most part this person is driving smooth. Just about 15 seconds later you can actually see that this person is now speeding. The speed limit there is 30 miles per hour and this person is going 49.56 miles per hour. Keep in mind every one of these dots is a single second. Four seconds later this customer actually hits the brakes really hard, and now you’re able to see that red line or the red circle which shows harsh braking.
That’s nice information — what are we doing with that? We’re able to provide the customer with driving tips. These are tips that will help them understand their overall trip score, what their acceleration is, what their speed is. We are able to now take these tips and show them how they can get better miles per gallon. We’re also able to allow them to say, this is my data and I’m going to actually do something with it. They can take this data and they can send it out to a another service called Toyota Information Insurance Management Services. If they are a good driver, they can send this data to different insurance companies and have them bid for discounts for that customer. US models starting with Camry 2018 which came out in July have these sensors. Now these sensors are actually dead at the dealership — you have to actually go through a walkthrough where the dealer talks to you about what the data is collected how it’s being collected and then you can actually opt-in. We don’t we don’t collect the data unless you opt-in, but it’s for the most recent Camry, the upcoming rav4, upcoming Avalon, and I think there’s a handful of Lexus vehicles as well. Fleet is one of our number one customers because the fleet customers want to know things that are a lot more detailed in terms of vehicle location. They want to understand such things as, are the windows open or closed, has the car been in a collision, what is the fuel level.
A lot of things that I mentioned earlier with the Avis project that’s actually gone live for a fleet to be able to understand the health of the vehicle and to cut down on the time that they’re spending on the checkout process. It’s not in blockchain but we do have a data store that the data is being saved in, yes. Question: hey why do you call it a G Force because what you’re really measuring is stress, strain, and shear forces? G forces are typically used as a nomenclature when you have a force large enough to make it feel like a percentage of gravity — at least half a G. Right, so the transfer of the G Force is from kinetic energy to potential energy. That’s how we are able to understand our harsh cornering, harsh braking as well. It’s not just the brake pressure at all, true, but I think that’s the way that we look at it. There is a guideline and they actually consider it as g-force. You can actually get it after the trip is completed, we do this at a trip level. We do have the potential to provide it in real time but we actually only provide that at the end of the trip for at least the next version that will be coming out.
The next type of service that we provide is what we call collision detection. Very similarly we’re looking at the longitudinal lateral g-force, the acceleration, brake pressure. Here I’d like to say that you know the data really tells a story and this is just an example — where we have the longitudinal, lateral, and vertical g-force; so that’s the X, Y, & Z axis here. And here you can see the acceleration — a red line here and so as the customer is driving and they accelerate you can actually see that in that line coming down here and it goes to about there as they stop. Now the moment you see that transfer of energy — you can see that it’s right about here — and that’s actually what we’re able to use to understand collision notifications. The problem is that — this person was lucky because the airbag was triggered — the way the sensors are set up on the vehicle as well as how we are measuring that, we were able to understand front collisions and we’ve been doing that for many, many years. But the problem is that the airbag does not go off when you have a rear-end collision or side collision or if you were to flip over and find yourself in the bottom of a ditch. And so because of that we’ve realized that we have to now really start looking at the data a little bit differently and understand that some of the airbag sensors that yes, they do trigger a notification and they will call the paramedics — these are things that are not enough to ensure the safety of our customers.
What we’ve been doing now, is looking at classifying crashes into three different buckets, and then also looking at how do we eliminate some of the false positives. For instance, this is the area right here where you’re going between five and eight point five in the magnitude of the g-force and that’s where we traditionally will have airbags deployed. If you then go and look at areas that are not being looked at — these are things that we mentioned, you know the high to medium/low speed crashes where it is a side impact, where it is a rear impact, where the cars flipped over. Now those are the areas where we need to be able to understand but also eliminate the false positive. Harsh braking and harsh cornering are absolutely false positives. What about other things like hitting a shopping cart? Well that’s not so bad. What about going over a speed bump? Well that’s also not a collision. We teamed up with a number of companies to pull in data so that we can actually compare some of the things like video data, data where they’ve actually done crash tests — and then pass those through our models to be able to derive kind of what we call the area of opportunity or the areas where we can now provide newer and better services than were available before.
Different drivers who drive in the same car have different styles — like my wife and I, we share one of those Camrys. In the future though I think we will have head units [that] are very much like Netflix, where you can actually use a profile and based on that profile it’ll be able to keep your settings from a temperature perspective, what radio stations you listen to, what are the places that you like to eat, or do recommendations. However, we don’t have that today in our vehicles.
Today we do use cloud, and we only use cloud for what we’re doing today in terms of metadata management. When this application was created, we had specifications coming from our product engineers that really defined all the data. It’s telemetry data, it’s all structured, we have data dictionaries on everything to help the data scientists understand what it is. One of the things that we are doing as a new company, we are really starting on our journey to data science. I was actually hired to build out a data science practice for the company. We pulled all of our data scientists to gather and really talk about the things that are possible with this data — we d this on regular basis.
There’s a lot of things that we see from a services perspective that we can provide to the customer. In fact, we only do this to provide services for the customer. There’s no reason to use the data unless we can make our products better and we can provide these new types of services for our customers. For instance, in this case we envision that when our collision notification is ready to be deployed — this is something that we’ll be calling you over the telephone and having someone saying “hey we noticed that there’s been a collision, are you okay?” — if someone doesn’t answer we dispatch a unit to that location. Or being able to essentially have a notification pop-up on their phone, saying do you need to send assistance? There’s going to be low impact use cases where someone got rear-ended, we still want to be there for our customer. I think that there was a statistic that one of our data scientists provided me, which was, one out of every ten collisions that ends in a fatality could have been prevented if we had the ability to get someone dispatched quicker. For me, that’s something that I’m definitely looking towards seeing what we can do on our side to make a difference there and to make our vehicles safer.
We’ve talked a lot about what we do as a company. From a company culture perspective, we are absolutely focused on bringing in the best talent. We are really connected and are committed to helping folks understand what we are all about, what we do and entice anyone that might be interested to work for us. We have a number of open positions and please come see me if you’re interested :).
I bought a Toyota Camry mostly because I wanted to understand the full experience, what is it that our dealerships are truly saying to the customer and just so I could understand and be able to talk about it as a Toyota customer. I was actually very impressed with the way they walked me through everything and how they showed us these are things that you can do to enable these services and this is the data we collect and the reality is that when my wife is driving the car of course I’m gonna opt-in. Why? Because I want safety connect. Why? Because if she’s ever in an accident they’re there for her. These are services that I think people are going to opt-in for because they provide a genuine value add to the vehicle. I bought the vehicle also because it has what they call smart sense to be able to understand if someone’s to the right or left to me before I’m making a lane change. All of this really comes down to safety features. From a scale perspective we definitely recognize that the cost to be able to provide these services is something that we’re grappling with. We’re absolutely looking at ways to optimize algorithms — how we’re storing the data, when how much we’re storing, to really only hold the value add attributes to be able to do these algorithms and provide these services. Thank you very much!
Join us at the next DSS near you:
We’ve got a lineup of equally impressive speakers from companies like Viacom, Netflix, Buzzfeed, Forbes, Verizon, Nielsen, Comcast, Bloomberg, Uber, Google and many, many more.


Monday, June 25, 2018

Using Keras+TensorFlow to solve NCFM-Leadboard Top 5%

https://www.kaggle.com/c/the-nature-conservancy-fisheries-monitoring/discussion/26202

Github: https://github.com/pengpaiSH/Kaggle_NCFM
Step1. Download dataset from https://www.kaggle.com/c/the-nature-conservancy-fisheries-monitoring/data
Step2. Use split_train_val.py to split the labed data into training and validation. Usually, 80% for training and 20% for validation is a good start.
Step3. Use train.py to train a Inception_V3 network. The best model and its weights will be saved as "weights.h5".
Step4. Use predict.py to predict labels for testing images and generating the submission file "submit.csv". Note that such submission results in a 50% ranking in the leaderboard.
Step5. In order to improve our ranking, we use data augmentation for testing images. The intuition behind is similar to multi-crops, which makes use of voting ideas. predict_average_augmentation.py implements such idea and results in a 10% ranking (Public Score: 1.09) in the leaderboard.
Step 6. Note that there is still plenty of room for improvement. For example, we could split data into defferent training and valition data by cross-validation, e.g. k-fold. Then we train k models based on these splitted data. We average the predictions output by the k models as the final submission. This strategy will result a 5% ranking (Public Score: 1.02) in the leaderboard. We will leave the implementation as a practice for readers :)
Step 7: if you wanna to improve ranking further, object detection is your next direction!
Update and Note: In order to use flow_from_directory(), you should create a folder named test_stg1 and put the original test_stg1 inside it.

Deep Learning WIth Python Notebooks

http://nbviewer.jupyter.org/github/fchollet/deep-learning-with-python-notebooks/blob/master/2.1-a-first-look-at-a-neural-network.ipynb

https://github.com/fchollet/deep-learning-with-python-notebooks

Wednesday, June 20, 2018

Machine Learning Kaggle Competition Part Two: Improving

https://towardsdatascience.com/machine-learning-kaggle-competition-part-two-improving-e5b4d61ab4b8

Feature engineering, feature selection, and model evaluation

Like most problems in life, there are several potential approaches to a Kaggle competition:
  1. Lock yourself away from the outside world and work in isolation
I recommend against the “lone genius” path, not only because it’s exceedingly lonely, but also because you will miss out on the most important part of a Kaggle competition: learning from other data scientists. If you work by yourself, you end up relying on the same old methods while the rest of the world adopts more efficient and accurate techniques.
As a concrete example, I recently have been dependent on the random forest model, automatically applying it to any supervised machine learning task. This competition finally made me realize that although the random forest is a decent starting model, everyone else has moved on to the superior gradient boosting machine.
The other extreme approach is also limiting:
2. Copy one of the leader’s scripts (called “kernels” on Kaggle), run it, and shoot up the leaderboard without writing a single line of code
I also don’t recommend the “copy and paste” approach, not because I’m against using other’s code (with proper attribution), but because you are still limiting your chances to learn. Instead, what I do recommend is a hybrid approach: read what others have done, understand and even use their code, and build on other’s work with your own ideas. Then, release your code to the public so others can do the same process, expanding the collective knowledge of the community.

In the second part of this series about competing in a Kaggle machine learning competition, we will walk through improving on our initial submission that we developed in the first part.
The major results documented in this article are:
  • An increase in ROC AUC from a baseline of 0.678 to 0.779
  • Over 1000 places gained on the leaderboard
  • Feature engineering to go from 122 features to 1465
  • Feature selection to reduce the final number of features to 342
  • Decision to use a gradient boosting machine learning model
We will walk through how we achieve these results — covering a number of major ideas in machine learning and building on other’s code where applicable. We’ll focus on three crucial steps of any machine learning project:
  1. Feature engineering
  2. Feature selection
  3. Model evaluation
To get the most out of this post, you’ll want to follow the Python notebooks on Kaggle (which will be linked to as they come up). These notebooks can be run on Kaggle without downloading anything on your computer so there’s little barrier to entry! I’ll hit the major points at a high-level in this article, with the full details in the notebooks.

Brief Recap

If you’re new to the competition, I highly recommend starting with this article and this notebook to get up to speed.
The Home Credit Default Risk competition on Kaggle is a standard machine learning classification problem. Given a dataset of historical loans, along with clients’ socioeconomic and financial information, our task is to build a model that can predict the probability of a client defaulting on a loan.
In the first part of this series, we went through the basics of the problem, explored the data, tried some feature engineering, and established a baseline model. Using a random forest and only one of the seven data tables, we scored a 0.678 ROC AUC (Receiver Operating Characteristic Area Under the Curve) on the public leaderboard. (The public leaderboard is calculated with only 20% of the test data and the final standings usually change significantly.)
To improve our score, in this article and a series of accompanying notebooks on Kaggle, we will concentrate primarily on feature engineering and then on feature selection. Generally, the largest benefit relative to time invested in a machine learning problem will come in the feature engineering stage. Before we even start trying to build a better model, we need to focus on using all of the data in the most effective manner!

Notes on the Current State of Machine Learning

Much of this article will seem exploratory (or maybe even arbitrary) and I don’t claim to have made the best decisions! There are a lot of knobs to tune in machine learning, and often the only approach is to try out different combinations until we find the one that works best. Machine learning is more empirical than theoretical and relies on testing rather than working from first principles or a set of hard rules.
In a great blog post, Pete Warden explained that machine learning is a little like banging on the side of the TV until it works. This is perfectly acceptable as long as we write down the exact “bangs” we made on the TV and the result each time. Then, we can analyze the choices we made, look for any patterns to influence future decisions, and find which method works the best.
My goal with this series is to get others involved with machine learning, put my methods out there for feedback, and document my work so I can remember what I did for the next time! Any comments or questions, here or on Kaggle, are much appreciated.

Feature Engineering

Feature engineering is the process of creating new features from existing data. The objective is to build useful features that can help our model learn the relationship between the information in a dataset and a given target. In many cases — including this problem — the data is spread across multiple tables. Because a machine learning model must be trained with a single table, feature engineering requires us to summarize all of the data in one table.
This competition has a total of 7 data files. In the first part, we used only a single source of data, the main file with socioeconomic information about each client and characteristics of the loan application. We will call this table app.(For those used to Pandas, a table is just a dataframe).
Main training dataframe
We can tell this is the training data because it includes the label, TARGET. A TARGET value of 1 indicates a loan which was not repaid.
The app dataframe is tidy structured data: there is one row for every observation — a client’s application for a loan — with the columns containing the features (also known as the explanatory or predictor variables). Each client’s application — which we will just call a “client” — has a single row in this dataframe identified by the SK_ID_CURR. Because each client has a unique row in this dataframe, it is the parent of all the other tables in the dataset as indicated by this diagram showing how the tables are related:
Relationship of data files (Source)
When we make our features, we want to add them to this main dataframe. At the end of feature engineering, each client will still have only a single row, but with many more columns capturing information from the other data tables.

The six other tables contain information about clients’ previous loans, both with Home Credit (the institution running the competition), and other credit agencies. For example, here is the bureau dataframe, containing client’s previous loans at other financial institutions:
bureau dataframe, a child of app
This dataframe is a child table of the parentapp: for each client (identified by SK_ID_CURR) in the parent, there may be many observations in the child. These rows correspond to multiple previous loans for a single client. The bureau dataframe in turn is the parent of the bureau_balance dataframe where we have monthly information for each previous loan.
Let’s look at an example of creating a new feature from a child dataframe: the count of the number of previous loans for each client at other institutions. Even though I wrote a post about automated feature engineering, for this article we will stick to doing it by hand. The first Kaggle notebook to look at is here: is a comprehensive guide to manual feature engineering.
Calculating this one feature requires grouping (using groupby)the bureau dataframe by the client id, calculating an aggregation statistic (using agg with count) and then merging (using merge) the resulting table with the main dataframe. This means that for each client, we are gathering up all of their previous loans and counting the total number. Here it is in Python:
app dataframe with new feature in second column
Now our model can use the information of the number of previous loans as a predictor for whether a client will repay a loan. To inspect the new variable, we can make a kernel density estimate (kde) plot. This shows the distribution of a single variable and can be thought of as a smoothed histogram. To see if the distribution of this feature varies based on whether the client repaid her/his loan, we can color the kde by the value of the TARGET:
There does not appear to be much of a difference in the distribution, although the peak of the TARGET==1 distribution is slightly to the left of the TARGET==0 distribution. This could indicate clients who did not repay the loan tend to have had fewer previous loans at other institutions. Based on my extremely limited domain knowledge, this relationship would make sense!
Generally, we do not know whether a feature will be useful in a model until we build the model and test it. Therefore, our approach is to build as many features as possible, and then keep only those that are the most relevant. “Most relevant” does not have a strict definition, but we will see some ways we can try to measure this in the feature selection section.

Now let’s look at capturing information not from a direct child of the app dataframe, but from a child of a child of app! The bureau_balance dataframe contains monthly information about each previous loan. This is a child of the bureau dataframe so to get this information into the main dataframe, we will have to do two groupbys and aggregates: first by the loan id (SK_ID_BUREAU) and then by the client id.
As an example, if we want to calculate for each client the average of the max number of MONTHS_BALANCE for each previous loan in the bureau_balance dataframe, we can do this:
app dataframe with new feature in second column
Distribution of new feature
This was a lot of code for a single feature, and you can easily imagine that the manual feature engineering process gets tedious after a few features! That’s why we want to write functions that take these individual steps and repeat them for us on each dataframe.
Instead of repeating code over and over, we put it into a function — called refactoring — and then call the function every time we want to perform the same operation. Writing functions saves us time and allows for more reproducible workflows because it will execute the same actions in exactly the same way every time.
Below is a function based on the above steps that can be used on any child dataframe to compute aggregation statistics on the numeric columns. It first groups the columns by a grouping variable (such as the client id), calculates the mean, max, min, sum of each of these columns, renames the columns, and returns the resulting dataframe. We can then merge this dataframe with the main app data.
(This function draws heavily on this kernel from olivier on Kaggle).
(Half of the lines of code for this function is documentation. Writing proper docstrings is crucial not only for others to understand our code, but so we can understand our own code when we come back to it!)
To see this in action, refer to the notebook, but clearly we can see this will save us a lot of work, especially with 6 children dataframes to process.

This function handles the numeric variables, but that still leaves the categorical variables. Categorical variables, often represented as strings, can only take on a limited number of values (in contrast to continuous variables which can be any numeric value). Machine learning models cannot handle string data types, so we have to find a way to capture the information in these variables in a numeric form.
As an example of a categorical variable, the bureau table has a column called CREDIT_ACTIVE that has the status of each previous loan:
Two columns of the bureau dataframe showing a categorical variable (CREDIT_ACTIVE)
We can represent this data in a numeric form by counting the number of each type of loan that each client has had. Moreover, we can calculate the normalized count of each loan type by dividing the count for one particular type of loan by the total count. We end up with this:
Categorical CREDIT_ACTIVE features after processing
Now these categorical features can be passed into a machine learning model. The idea is that we capture not only the number of each type of previous loan, but also the relative frequency of that type of loan. As before, we don’t actually know whether these new features will be useful and the only way to say for sure is to make the features and then test them in a model!
Rather than doing this by hand for every child dataframe, we again can write a function to calculate the counts of categorical variables for us. Initially, I developed a really complicated method for doing this involving pivot tables and all sorts of aggregations, but then I saw other code where someone had done the same thing in about two lines using one-hot encoding. I promptly discarded my hours of work and used this version of the function instead!
This function once again saves us massive amounts of time and allows us to apply the same exact steps to every dataframe.
Once we write these two functions, we use them to pull all the data from the seven separate files into one single training (and one testing dataframe). If you want to see this implemented, you can look at the first and second manual engineering notebooks. Here’s a sample of the final data:
Using information from all seven tables, we end up with a grand total of 1465 features! (From an original 122).
How do we know if any of these features are helpful? One method is to calculate the Pearson correlation coefficient between the variables and the TARGET. This is a relatively crude measure of importance, but it can serve as an approximation of which variables are related to a client’s ability to repay a loan. Below are the most correlated variables with the TARGET:
Most Positive (left) and Negative (right) correlated variables with the TARGET
The EXT_SOURCE_ variables were from the original features, but some of the variables we created are among the top correlations. However, we want to avoid reading too much into these numbers. Anytime we make a ton of features, we can run into the multiple comparisons problem: the more comparisons we make — in this case correlations with the target — the more likely some of them are to be large due to random noise. With correlations this small, we need to be especially careful when interpreting the numbers.
The most negatively correlated variable we made, client_bureau_balance_counts_mean, represents the average for each client of the count of the number of times a loan appeared in the bureau_balance data. In other words, it is the average number of monthly records per previous loan for each client. The kde plot is below:
Now that we have 1465 features, we run into the problem of too many features! More menacingly, this is known as the curse of dimensionality, and it is addressed through the crucial step of feature selection.

Feature Selection

Too many features can slow down training, make a model less interpretable, and, most critically, reduce the model’s generalization performance on the test set. When we have irrelevant features, these drown out the important variables and as the number of features increases, the number of data points needed for the model to learn the relationship between the data and the target grows exponentially (curse of dimensionality explained).
After going to all the work of making these features, we now have to select only those that are “most important” or equivalently, discard those that are irrelevant.
The next notebook to go through is here: a guide to feature selection which is fairly comprehensive although it still does not cover every possible method!
There are many ways to reduce the number of features, and here we will go through three methods:
  1. Removing collinear variables
  2. Removing variables with many missing values
  3. Using feature importances to keep only “important” variables

Remove Collinear Variables

Collinear variables are variables that are highly correlated with one another. These variables are redundant in the sense that we only need to keep one out of each pair of collinear features in order to retain most of the information in both. The definition of highly correlated can vary and this is another of those numbers where there are no set rules! As an example of collinear variables, here is a plot of the median apartment size vs average apartment size:
To identify highly correlated variables, we can calculate the correlation of every variable in the data with every other variable (this is quite a computationally expensive process)! Then we select the upper triangle of the correlation matrix and remove one variable from every pair of highly correlated variables based on a threshold. This is implemented in code below:
(This code is adapted from this blog post.)
In this implementation, I use a correlation coefficient threshold of 0.9 to remove collinear variables. So, for each pair of features with a correlation greater than 0.9, we remove one of the pair of features. Out of 1465 total features, this removes 583, indicating many of the variables we created were redundant.

Remove Missing Columns

Of all the feature selection methods, this seems the most simple: just eliminate any columns above a certain percentage of missing values. However, even this operation brings in another choice to make, the threshold percentage of missing values for removing a column.
Moreover, some models, such as the Gradient Boosting Machine in LightGBM, can handle missing values with no imputation and then we might not want to remove any columns at all! However, because we’ll eventually test several models requiring missing values to be imputed, we’ll remove any columns with more than 75% missing values in either the training or testing set.
This threshold is not based on any theory or rule of thumb, rather it’s based on trying several options and seeing which worked best in practice. The most important point to remember when making these choices is that they don’t have to be made once and then forgotten. They can be revisited again later if the model is not performing as well as expected. Just make sure to record the steps you took and the performance metrics so you can see which works best!
Dropping columns with more than 75% missing values removes 19 columns from the data, leaving us with 863 features.

Feature Selection Using Feature Importances

The last method we will use to select features is based on the results from a machine learning model. With decision tree based classifiers, such as ensembles of decision trees (random forests, extra trees, gradient boosting machines), we can extract and use a metric called the feature importances.
The technical details of this is complicated (it has to do with the reduction in impurity from including the feature in the model), but we can use the relative importances to determine which features are the most helpful to a model. We can also use the feature importances to identify and remove the least helpful features to the model, including any with 0 importance.
To find the feature importances, we will use a gradient boosting machine (GBM) from the LightGBM library. The model is trained using early stopping with two training iterations and the feature importances are averaged across the training runs to reduce the variance.
Running this on the features identifies 308 features with 0.0 importance.
Removing features with 0 importance is a pretty safe choice because these are features that are literally never used for splitting a node in any of the decision trees. Therefore, removing these features will have no impact on the model results (at least for this particular model).
This isn’t necessary for feature selection, but because we have the feature importances, we can see which are the most relevant. To try and get an idea of what the model considers to make a prediction, we can visualize the top 15 most important features:
Top 15 most important features
We see that a number of the features we built made it into the top 15 which should give us some confidence that all our hard work was worthwhile! One of our features even made it into the top 5. This feature, client_installments_AMT_PAYMENT_min_sum represents the sum of the minimum installment payment for each client of their previous loans at Home Credit. That is, for each client,it is the sum of all the minimum payments they made on each of their previous loans.
The feature importance don’t tell us whether a lower value of this variable corresponds to lower rates of default, it only lets us know that this feature is useful for making splits of decision trees nodes. Feature importances are useful, but they do not offer a completely clear interpretation of the model!

After removing the 0 importance features, we have 536 features and another choice to make. If we think we still have too many features, we can start removing features that have a minimal amount of importance. In this case, I continued with feature selection because I wanted to test models besides the gbm that do not do as well with a large number of features.
The final feature selection step we do is to retain only the features needed to account for 95% of the importance. According to the gradient boosting machine, 342 features are enough to cover 95% of the importance. The following plot shows the cumulative importance vs the number of features.
Cumulative feature importance from the gradient boosting machine
There are a number of other dimensionality reduction techniques we can use, such as principal components analysis (PCA). This method is effective at reducing the number of dimensions, but it also transforms the features to a lower-dimension feature space where they have no physical representation, meaning that PCA features cannot be interpreted. Moreover, PCA assumes the data is normally distributed, which might not be a valid assumption for human-generated data. In the notebook I show how to use pca, but don’t actually apply it to the data.
We can however use pca for visualizations. If we graph the first two principal components colored by the value of the TARGET , we get the following image:
First two principal components of the data
The two classes are not cleanly separated with only two principal components and clearly we need more than two features to identify customers who will repay a loan versus those who will not.
Before moving on, we should record the feature selection steps we took so we remember them for future use:
  1. Remove collinear variables with a correlation coefficient greater than 0.9: 583 features removed
  2. Remove columns with more than 75% missing values: 19 features removed
  3. Remove 0.0 importance features according to a GBM: 308 features removed
  4. Keep only features needed for 95% of feature importance: 193 features removed
The final dataset has 342 features.
If it seems like there are a few arbitrary choices made during feature selection, that’s because there were! At a later point, we might want to revisit some of these choices if we are not happy with our performance. Fortunately, because we wrote functions and documented our decisions, we can easily change some of the parameters and then reassess performance.

Model Selection

Generally, at this point in a machine learning problem, we would move on to evaluating a number of models. No model is better than all others at every task (the “no free lunch theorem”), and so we need to try out a range of models to determine which one to use. However, in recent years, one model has become increasingly successful for problems with medium-sized structured data: the gradient boosting machine. (There are a number of reasons why this model works so well, and for a comprehensive guide, this Master’s Thesis is a great read.)
Model selection is one area where I relied heavily on the work of others. As mentioned at the beginning of the post, prior to this competition, my go-to model was the random forest. Very early on in this competition though, it was clear from reading the notebooks of others that I would need to implement some version of the gradient boosting machine in order to compete. Nearly every submission at the top of the leaderboard on Kaggle uses some variation (or multiple versions) of the Gradient Boosting Machine. (Some of the libraries you might see used are LightGBM, CatBoost, and XGBoost.)
Over the past few weeks, I have read through a number of kernels (see here and here) and now feel pretty confident deploying the Gradient Boosting Machine using the LightGBM library (Scikit-Learn does have a GBM, but its not as efficient or as accurate as other libraries). Nonetheless, mostly for curiosity’s sake, I wanted to try several other methods to see just how much is gained from the GBM. The code for this testing can be found on Kaggle here.
This isn’t entirely a fair comparison because I was using mostly the default hyperparameters in Scikit-Learn, but it should give us a first approximation of the capabilities of several different models. Using the dataset after applying all of the feature engineering and the feature selection, below are the modeling results with the public leaderboard scores. All of the models except for the LightGBM are built in Scikit-Learn:
  • Logistic Regression = 0.768
  • Random Forest with 1000 trees = 0.708
  • Extra Trees with 1000 trees = 0.725
  • Gradient Boosting Machine in Scikit-Learn with 1000 trees = 0.761
  • Gradient Boosting Machine in LightGBM with 1000 trees = 0.779
  • Average of all Models = 0.771
It turns out everyone else was right: the gradient boosting machine is the way to go. It returns the best performance out of the box and has a number of hyperparameters that we can adjust for even better scores. That does not mean we should forget about other models, because sometimes adding the predictions of multiple models together (called ensembling) can perform better than a single model by itself. In fact, many winners of Kaggle competitions used some form of ensembling in their final models.

We didn’t spend too much time here on the models, but that is where our focus will shift in the next notebooks and articles. Next we can work on optimizing the best model, the gradient boosting machine, using hyperparameter optimization. We may also look at averaging models together or even stacking multiple models to make predictions. We might even go back and redo feature engineering! The most important points are that we need to keep experimenting to find what works best, and we can read what others have done to try and build on their work.

Conclusions

Important character traits of being a data scientist are curiosity and admitting you don’t know everything! From my place on the leaderboard, I clearly don’t know the best approach to this problem, but I’m willing to keep trying different things and learn from others. Kaggle competitions are just toy problems, but that doesn’t prevent us from using them to learn and practice concepts to apply to real projects.
In this article we covered a number of important machine learning topics:
  1. Using feature engineering to construct new features from multiple related tables of information
  2. Applying feature selection to remove irrelevant features
  3. Evaluating several machine learning models for applicability to the task
After going through all this work, we were able to improve our leaderboard score from 0.678 to 0.779, in the process moving over a 1000 spots up the leaderboard. Next, our focus will shift to optimizing our selected algorithm, but we also won’t hesitate to revisit feature engineering/selection.
If you want to stay up-to-date on my machine learning progress, you can check out my work on Kaggle: the notebooks are coming a little faster than the articles at this point! Feel free to get started on Kaggle using these notebooks and start contributing to the community. I’ll be using this Kaggle competition to explore a few interesting machine learning ideas such as Automated Feature Engineering and Bayesian Hyperparameter Optimization. I plan on learning as much from this competition as possible, and I’m looking forward to exploring and sharing these new techniques!