Friday, January 16, 2015

How to get started with the Data Science Bowl

http://blog.dominodatalab.com/how-to-get-started-with-the-data-science-bowl/

I am thrilled to share a Domino project we’ve created with starter code in R and Python for participating in the Data Science Bowl. Our starter project can give you a jump start in the competition by letting you train your models on massive hardware and by letting you run multiple experiments in parallel while keeping track of your results.

Introduction

The Data Science Bowl is a Kaggle competition — with $175,000 in prize money and an opportunity to help improve the health of our oceans — to classify images of plankton.
Domino is a platform that lets you build and deploy your models faster, using R, Python, and other languages. To help Data Science Bowl competitors, we have packaged some sample code into a Domino project that you can easily fork and use for your own work.
This post describes how our sample project can help you compete in the Bowl, or do other open-ended machine learning projects. First, we give an overview of the code we've packaged up. Then we describe three capabilities Domino offers: easily scalable infrastructure; a powerful experimentation workflow; and a way to turn your models into self-service web forms.

Contents

  1. Three starter scripts you can use: an IPython Notebook for interactive work, a python script for long-running training, and an R script for long-running training.
  2. Scalable infrastructure and parallelism to train models faster.
  3. Experimenting in parallel while tracking your work so you can iterate on your models faster.
  4. Building a self-service Web diagnostic tool to test the trained model(s).
  5. How to fork our project and use it yourself to jumpstart your own work.


R & Python starter scripts

IPython Notebook

We took Aaron Sander’s fantastic tutorial and turned it into an actual IPython Notebook. You can download the full notebook from our Domino project or view a calculated, rendered version.

Python batch script

Next, we extracted the key training parts of Aaron’s tutorial and turned them into a batch script. Most of the code is the same as what’s in the IPython Notebook, but we excluded the diagnostic code for visualizing sample images.
The result is train.py. You can see the output of running this code here

R batch script

For an R example, we used Jeff Hebert’s PlanktonClassification project. In our Domino project, you can find this code in train.R or see the results from running it here.
As I’ll describe more below, there’s a separate parallel version of this R code, in train_parallel.R


Train faster


Domino lets you train your models much faster by scaling up your hardware with a single click. For example, you can use 8-, 16-, or even 32-core machines. To take advantage of this, we needed to generalize some of the code to better utilize multiple cores.
As you can see from the different experiments we ran, we had some significant speed boosts. For example:
  • The Python code took 50 min on a single core machine. With our parallelized version, it took 6.5 min on a 32-core machine
  • The R code took 14 min on a single core machine. With our parallelized version, it took 4 min on a 32-core machine

Python

Both in the IPython Notebook and in the train.py batch script, we modified the calls that actually train the RF classifier. The original code used n_jobs=3 which would use three cores. We changed this to n_jobs=-1 which will use all cores on the machine.
The original, non-parallel code
kf = KFold(y, n_folds=5)  
y_pred = y * 0  
for train, test in kf:  
    X_train, X_test, y_train, y_test = X[train,:], X[test,:], y[train], y[test]
    clf = RF(n_estimators=100, n_jobs=3)
    clf.fit(X_train, y_train)
    y_pred[test] = clf.predict(X_test)
print classification_report(y, y_pred, target_names=namesClasses)  
Our parallel version
kf = KFold(y, n_folds=5)  
y_pred = y * 0  
for train, test in kf:  
    X_train, X_test, y_train, y_test = X[train,:], X[test,:], y[train], y[test]
    clf = RF(n_estimators=100, n_jobs=-1)
    clf.fit(X_train, y_train)
    y_pred[test] = clf.predict(X_test)
print classification_report(y, y_pred, target_names=namesClasses)  

R

There are two places in the R code that benefited from parallelism.
First, training the random forest classifier. We use the foreach package with the doParallel backend to train parts of the forest in parallel and combine them all. It looks like a lot more code, but most of it is ephemera from loading and initializing the parallel libraries.
The original, non-parallel code
plankton_model <- randomForest(y = y_dat, x = x_dat)  
Our parallel version
library(foreach)  
library(doParallel)  
library(parallel)

numCores <- detectCores()  
registerDoParallel(cores = numCores)

trees_per_core = floor(num_trees / numCores)  
plankton_model <- foreach(num_trees=rep(trees_per_core, numCores), .combine=combine, .multicombine=TRUE, .packages='randomForest') %dopar% {  
                  randomForest(y = y_dat, x = x_dat, ntree = num_trees)
                }
A second part of the R code is also time-consuming and easily parallelized: processing the test images to extract their features before generating test statistics. We use a parallel for loop to process the images across all our cores.
The original, non-parallel code
test_data <- data.frame(image = rep("a",test_cnt), length=0,width=0,density=0,ratio=0, stringsAsFactors = FALSE)  
    idx <- 1
    #Read and process each image
for(fileID in test_file_list){  
    working_file <- paste(test_data_dir,"/",fileID,sep="")
    working_image <- readJPEG(working_file)

    # Calculate model statistics       
    working_stats <- extract_stats(working_image)
    working_summary <- array(c(fileID,working_stats))
    test_data[idx,] <- working_summary
    idx <- idx + 1
    if(idx %% 10000 == 0) cat('Finished processing', idx, 'of', test_cnt, 'test images', '\n')
}
Our parallel version
# assumes cluster is already set up from use above
names_placeholder <- data.frame(image = rep("a",test_cnt), length=0,width=0,density=0,ratio=0, stringsAsFactors = FALSE)  
    #Read and process each image
working_summaries <- foreach(fileID = test_file_list, .packages='jpeg') %dopar% {  
    working_file <- paste(test_data_dir,"/",fileID,sep="")
    working_image <- readJPEG(working_file)

    # Calculate model statistics

    working_stats <- extract_stats(working_image)
    working_summary <- array(c(fileID,working_stats))
}
library(plyr)  
test_data = ldply(working_summaries, .fun = function(x) x, .parallel = TRUE)  
# a bit of a hack -- use the column names from the earlier dummy frame we defined
colnames(test_data) = colnames(names_placeholder)  


Experiment & track results

Domino helps you develop your models faster by letting you experiment in parallel while keeping your results automatically tracked. Whenever you run your code, Domino keeps a record of it, and keeps a record of the result that you produced, so you can track your process and reproduce past work whenever you want.
For example, since our R code saves a submission.csv file when it runs, we get automatic records of each submission we generate, whenever we run our code. If we need to get back to an old one, we can just find the corresponding run and view its results, which will have a copy of the submission.
Each run that you start on Domino gets its own machine, too (of whatever hardware type you selected) so you can try multiple different techniques or parameters in parallel.



Build self-service tools

Have you ever been interrupted by non-technical folks who ask you to run things for them because they can’t use your scripts on their own? We used Domino’s Launchers feature to build a self-service web form to classify different plankton images. Here’s how it works:
  1. Anyone can visit our project and go to the Launchers section, where they’ll find a “Classify plankton image” launcher. This will pop up a form that lets you upload a file from your computer.

  2. When you select a file and click “Run”, Domino will pass your image to a classification script (which uses the RF model trained by the Python code) to predict the class of plankton in the image. Classification just takes a second, and you’ll see results when it finishes, including a diagnostic image and the printout of the predicted class. For example:

Try it yourself

  1. Visit the launcher
  2. Upload an image. If you need an example, download this one to your computer

Implementation

To implement this, we made some additional modifications to the Python training script. Specifically, when the training task finishes, we pickle the model (and class names) so we can load them back later.
joblib.dump(clf, 'dump/classifier.pkl')  
joblib.dump(namesClasses, 'dump/namesClasses.pkl')  
Then we created a separate classify.py script that loads the pickled files and makes a prediction with them. The script also generates a diagnostic image, but the essence of it is this:
file_name = sys.argv[1]  
clf = joblib.load('dump/classifier.pkl')  
namesClasses = joblib.load('dump/namesClasses.pkl')

predictedClassIndex = clf.predict(image_to_features(file_name)).astype(int)  
predictedClassName = namesClasses[predictedClassIndex[0]]

print "most likely class is: " + predictedClassName  
Note that our classify script expects an image file name to be passed at the command line. This lets us easily build a Launcher to expose a UI web form around this script:



Getting started

  1. Sign up for a Domino account and install our command-line tool
  2. Fork the project by clicking the button in the left area of the project.
  3. Clone your new project by running domino get {your_username}/plankton
  4. Run code, e.g., domino run train.py. Or use the web interface to start an IPython Notebook session and open the starter notebook. See our notebook documentation if you need more help.

Implementation notes

  • Our project contains the zipped data sets, but it explicitly ignores the unzipped contents (you can see this inside the .dominoignore file). Because Domino tracks changes whenever run your code, having a huge number of files (160,000 images, in this case) can slow it down. To speed things up, we store the zip files, and let the code unzip them before running. Unzipping takes very little time, so this doesn’t impact performance overall.
  • In the Python code, scikitlearn uses joblib under the hood for parallelizing its random forest training task. joblib, in turn, defaults to using /dev/shm to store pickeled data. On Domino's machines, /dev/shm may not have enough space for these training sets, so we set an environment variable in our project’s settings that tells joblib to use /tmp, which will have plenty of space

No comments:

Post a Comment