Alright, that title is probably a tiny bit misleading. There are two minor corrections I should make.
It was a DrivenData competition, not Kaggle; and
I didn’t technically win.
Actually, this is a story about how I lost, won, lost again, thought I
finally won, lost one more time, and then redeemed myself. I imagine
this is what most data science competitions are like. This was my first.
TL;DR Version
4th place out of 535 teams.
Introduction to the Problem
I supposed I should start from the beginning. Once I discovered the
competition, I immediately sat down at my computer with Montell Jordan’s
‘This Is How We Do It’ playing in my head. How wrong I was.
The goal of the competition was to predict the number of Boston
restaurant health code violations based on Yelp review data. There were
three kinds of violations that you had to predict for. The lowest, level
one violations, were far more numerous than the other two types.
Essentially, level two was based on whether the restaurant had already
been cited for the same violation before. Level three was based on
whether the CDC was going to have to call the CDC because of a zombie
outbreak.
You also end up having to predict separately for each restaurant for
any point in time in the future. Often the competition required
predicting multiple inspection dates for a single restaurant. The
prediction format ended up looking like this. It looks pretty simple at
first glance, but the results needed to be much more complex and
layered. Prediction format
Some SciKit-Learn estimators
can handle an array as the output (linear regression), but most cannot.
So you have to predict for each violation type separately. I took it a
step further and based the model on a different set of features for each
kind of violation.
The contest scoring was based on a weighted root mean squared error
(RMSE). I decided to use this as well as the accuracy for each violation
type for testing at home. I made two super rudimentary baselines to
judge my models against.
Predicting every restaurant would have zero violations gave
accuracy of 22/69/57% for violation levels 1/2/3, respectively. RMSE was
2.14. Not too shabby.
Predicting every restaurant would have the mean of each violation type. Accuracy was 10/71/22% and RMSE 1.21.
I made a few simple models in the beginning to get a feel for how the
competition and its data worked. It was amazing how often my initial
models scored worse than these baselines. It really drove home how
inspections have very little to do with how much customers hate the
restaurant.
Losing Before You Even Begin
I had about two weeks left in the competition before I lost.
That probably needs further explanation. I race bicycles when I’m not
staring at a computer screen. A lot of people imagine that means I’m
doing something like this. Ugh, triathletes
No, this is the type of racing I do. Real racing
Ugh, triathletes. Alright, so with two weeks left in the competition I
became tangled up in a pretty bad race crash that required surgery. I
could barely move, let alone think while on the pain medication they
gave me so I ended up laying in bed watching the end of the competition
tick closer and closer. Inside hospital
That’s me trying to remember what R^2 means.
The drugs were so strong that I accidentally escaped from the hospital. Outside hospital
How to Win When You Lose
About a week after my surgery I started feeling well enough to take
myself off the pain medicine so I could start coding again. The
competition was over, but I decided to try and finish my model anyway. I
gave myself two weeks to finish from that point. The same amount of
time I had left before I dropped out of the competition.
Yelp had given us access to more than just the review text for each
restaurant. They had given us a large amount of metadata related to all
of the relevant restaurants, reviews, and users. Some of my most
important features (according to kbest, at least) turned out to be in
the metadata.
I started off by exploring the metadata. I find that looking at
graphical representations of data is much more helpful than looking at
the raw numbers. It’s just so much easier to visualize what’s going on
and to spot outliers.
Histograms are always useful for telling if you need to transform your data because the range of values is too large or skewed. Post-tranformation histogram
However, if you care more about seeing what’s going on rather than
solving the actual problem then my favorite visualizations are
coefficient and correlation plots.
This is a coefficient plot showing how many more violations the
average restaurant will get just on account of the neighborhood it is
based in (with a confidence interval of .95).
My favorite plot from this series was this one showing violations based on the type of cuisine the restaurant served.
One glance at this and you know that you should definitely avoid eating at health markets in Boston. Click here to see the full version on your ultra widescreen display
Plots!
Plots!!
More plots!!
Plots are fun. Who doesn’t love a good plotting?
Feature Engineering
I began cleaning up my data by removing reviews that occurred after
each inspection date. I then set to work turning date information into
useful features. A few of the date-related features I created were:
A delta representing the amount of time that had passed between
each review and each inspection date. After all, those reviews from five
years ago probably aren’t so relevant anymore.
A delta representing the amount of time that had passed between
each inspection date and the previous inspection date for that
restaurant. Often a restaurant would get reviewed again a week after a
particularly egregious inspection and magically all of its violations
would be corrected.
Decomposed inspection dates based on the theory that certain
violations are seasonal. For instance, I like to believe the rats only
come out in the Summer. This led to features like inspection_quarter and
inspection_dayofweek.
I also ended up separating the address of each restaurant into two
features consisting of the street name and the zip code. In NYC there
are some streets where all the restaurants are just inherently
disgusting; I hoped the same would apply to Boston.
In the end, I needed a separate prediction for each restaurant for
different future inspection dates. On top of that I had different sets
of multiple reviews for each restaurant. I decided to create multiple
observations for each inspection consisting of each review for that
inspection. With this, I ended up multiplying everything across my
dataframe. I went from 30,000 observations to almost 2 million.
Text Processing
I can’t emphasize enough just how long it takes to process two
million reviews on your home computer. Preprocessing, term-frequency
inverse-document-frequency (TFIDF), sentiment, and similarity vectors,
this was becoming a real drain on my system. It was taking almost 4.5
hours just for the preprocessing alone. I cut this down to just 18
minutes by taking advantage of the multiple cores in my computer with
Pool().
defpreprocess_pool(df,filename):# convert text to categoriescats=df.review_text.astype('category').cat# use multiprocessing to further cut down timepool=Pool()temp=pool.map(combine_preprocess,cats.categories)pool.close()pool.join()# convert the numerical categorical representation back to the newly processed# string representationdocs=[]foriincats.codes:docs.append(temp[i])df['preprocessed_review_text']=docs# mmm, picklesdf.to_pickle('pickle_jar/'+filename)returndfdefcombine_preprocess(text):b=TextBlob(unicode(text,'utf8').strip())tags=b.tagstokens=map(preprocess,tags)tokens=filter(None,tokens)return' '.join(tokens)defpreprocess(tagged):word=Word(tagged[0])ifword.isalpha()andwordnotinstopwords:# convert the part of speech tags to the correct formattag=penn_to_wn(tagged[1])l=word.lemmatize(tag)else:l=''returnl.lower()
There is nothing more glorious than watching all of the cores on your computer spin up.
I performed the following when preprocessing each review:
Converted each word into its individual tokens and made each lowercase
Removed stop words and anything that was numeric
Lemmatized each word
Lemmatizing normally assumes that you are giving it the noun
representation of each word. I went the extra step of getting the part
of speech for each word and passing that along as well so that I would
have more words lemmatized.
You may have noticed that I also converted my text from strings into a
categorical datatype. This way, when there were duplicate reviews, they
would be represented by a number and only need to be reviewed a single
time. I used this handy trick for most text related feature conversion.
When I was making similarity vectors, this cut the processing time from
nine hours to one.
I should probably explain what similarity vectors are. I created a
vector representation of how many times a word in a review was similar
to a specified keyword. Each numerical representation in the vector was a
measure of how similar each word was to the keyword. This measure was
created with the magical aid of Gensim and the word2vec algorithm.
Boston bases its health code violations on the 1999 Federal Food
Code. I read through the entire code and created a list of keywords that
I felt represented concepts that a reviewer would be more likely to
write about than the original legalese. I ended up with such lovely
keywords as:
raw
rotten
sneeze
gross
But also some more surprising ones:
lights
yellow
nails
jewelry
The Federal Food Code is really concerned with making sure a restaurant is bright enough to see yellow nails and jewelry.
The problem with this whole review–>violation concept and probably
one that also exists with my similarity vectors is that some of what is
in the food code is not readily observable by yelp reviewers. There is
no way a customer will know whether the shellfish box in the back of the
walk-in freezer is labeled correctly. We hope that if a restaurant
violates observable codes then they will also violate non-observable
ones. But beyond that, I also combined these reviews with other
metadata-based features to try and cover those other violations.
The metadata was split between boolean and categorical data. The
categorical data had to first be converted to a numerical representation
in order for it to be useful. I went the next step of turning each
numerical representation into a vector of one’s and zeros so that the
model wouldn’t start attaching order to the numerical representation. I
even ended up converting some numerical features to categorical under
the theory that they should also lose that ordering.
For instance, when looking at the review rating for a restaurant,
there is a value of either 1, 2, 3, 4, or 5. I wanted to remove the
ordering information because I was working under the assumption that
some five star reviews would be shills and some one star reviews would
be vindictive. This way, each review category is taken at its face
rather than as an increasing variable. It worked out like this:
1-star rating becomes [1, 0, 0, 0, 0]
2-star rating becomes [0, 1, 0, 0, 0]
3-star rating becomes [0, 0, 1, 0, 0]
4-star rating becomes [0, 0, 0, 1, 0]
5-star rating becomes [0, 0, 0, 0, 1]
And so on, and so on.
Using my newly created features, I started seeing some pretty good accuracy scores of around 90%.
Losing All Over Again
Everybody knows you can’t test your model on the same data that you
fit it to. I had been using Scikit-Learn’s train_test_split function to
split my data into a train set and a test set. I declared what random
integer it should seed with so that I could compare results across
different models. When I finally started seeing some good scores, I
thought that I should cross-validate what I was seeing across different
train-test-split folds of the data. (The reason I hadn’t been doing this
from the beginning is because it takes so much time.) In essence, I
wanted to make sure that my scores would be good for different cuts of
the data. Not just at cut number 42.
I have to say when the first score of 22% accuracy popped up it was
pretty disheartening. Soon it was followed by 20%, and then 19%.
At this point, I had been working on this already-over competition
for nearly a month. My deadline was fast approaching and I needed some
product out of all this use of my time. So I threw my hands in the air
and began emergency work on a D3 visualization using a choropleth map of
Boston with the following features.
Each neighborhood in Boston would be shaded according to what the average number of violations was for that neighborhood.
You click the violation level to have the average and shading change accordingly.
There is a slider at the bottom allowing a user to select the year
and see how the neighborhoods’ scores and shading changed over time.
Mouse-over a neighborhood and the name pops up.
It was going to be pretty dope.
Don’t bother clicking. That’s just a mockup of what it was going to
look like when it was finished. The mockup was made in D3 though, so
that has to count for something, right? If you’d like to play around
with the functioning, yet not-functional, slider then you can browse the
code on bl.ocks.
Just Kidding, I Won
Fortunately(?), about mid-way through making this visualization, my
mind started drifting back to my original problem. You see, I have a
rough, late-nights-with-hands-thrown-in-the-air, history with any
implementation of cross-validation by Scikit-Learn. So I was already
suspicious/angry.
I originally discovered a bug where if you enable multiprocessing
support with any of SciKit-Learns cross-validation functions then it
would cause the kernel to hang and never finish. As I started looking
over my code, I noticed that cross_val_score doesn’t shuffle your data
by default while train_test_split does.
Using the above code I was able to get shuffling working, and the
accuracy of my model shot up to around 95%. My RMSE was enough to put me
at the top of the leaderboard and win the competition (if I hadn’t
already lost).
Just Kidding, I Didn’t Win
I had a few days left before my self-imposed deadline, so I spent it
trying to further increase my accuracy. It was when I reached 99.9%
accuracy that I knew something was wrong.
My multiplied model included each restaurant’s ID in it. Normally,
this isn’t a problem and I believe a valuable feature in this particular
model. But when you multiply everything over and go from 30,000
observations to 2 million, then the chances of not having identical
restaurant IDs included in both your train and test set is pretty slim.
Combined with inspection date information, my random forest was noticing
the overlap, overfitting, and ending up with nearly 100% accuracy. Oh,
to dream.
Back to the Drawing Board
I needed a new method of dealing with the hierarchical aspect of the
problem. In the end I decided on somewhat of a makeshift solution.
Rather than have each review be a separate observation, I was going to
make each review a feature. Then each review-feature would be ordered
according to how close in time it was made to the inspection date.
Rather than do everything over I used the pivot feature in Pandas.
With that I was able to go from:
To:
I then took each review-feature-matrix and decomposed it into two
components using Factor Analysis. For my TFIDF matrix I did something a
little different. I decomposed it using Latent Semantic Analysis into
100 components. Technically, I would have probably gotten a better score
if I had left my TFIDF matrix as a raw sparse matrix, but combined with
all my other features it was just too slow an operation and I was out
of time.
The following is the code I wrote to test different review-based features and different decomposition.
defpivot_feature(df,feature,limit=None,decomp='lsi',decomp_features=2,fill='median'):# make the large dataframe faster to handle on pivottemp=df[['inspection_id','enumerated_review_delta']+[feature]]# pivot so that each inspection id only has one observation with each review a feature# for that observationpivoted_feature=temp.pivot('inspection_id','enumerated_review_delta')[feature]# pivoting creates a number of empty variables when they have less than the max# number of reviewsiffill=='median':fill_empties=lambdax:x.fillna(x.median())eliffill=='mean':fill_empties=lambdax:x.fillna(x.mean())eliffill==0:fill_empties=lambdax:x.fillna(0)eliffill=='inter':fill_empties=lambdax:x.interpolate()eliffill==None:fill_empties=lambdax:xelse:raiseExceptionpivoted_feature=pivoted_feature.apply(fill_empties,axis=1)ifdecomp=='lsi':decomposition=TruncatedSVD(decomp_features)elifdecomp=='pca':decomposition=PCA(decomp_features,whiten=True)elifdecomp=='kpca':decomposition=KernelPCA(decomp_features)elifdecomp=='dict':decomposition=DictionaryLearning(decomp_features)elifdecomp=='factor':decomposition=FactorAnalysis(decomp_features)elifdecomp=='ica':decomposition=FastICA(decomp_features)elifdecomp==None:passelse:raiseExceptionifnotlimit:try:returndecomposition.fit_transform(pivoted_feature)except:returnpivoted_featureelse:try:returndecomposition.fit_transform(pivoted_feature[[iforiinrange(limit)]])except:returnpivoted_feature[[iforiinrange(limit)]]
I also limited the reviews to those that had been created less than a
year before the inspection date and created a few new features.
A Trustworthiness Index for the writer of each review. It was
based in part on how objective the writing was, as well as how long they
had been a yelp member and how many reviews they had written.
An Anger Index based on how often a user scored a restaurant negatively compared to how frequently they made a review.
Similarity vectors v2. I recycled my original similarity vectors
into a single number representing how many words in each review achieved
a similarity score greater than a specified amount.
I spent a lot of time trying to hand pick which features to use. In
the end, I was out of time and just dumped everything into
SciKit-Learn’s Recursive Feature Elimination CV function and let it
prune the features for me.
I also realized that I had been wasting my time focusing on accuracy
and reality when the competition solely cared about RMSE. Besides, even
if this had been a real client, the City of Boston doesn’t need to know
exactly how many violations a restaurant will receive. They just need to
know that this restaurant will receive a lot, this restaurant will
receive a little, and this restaurant will receive none so they can
focus their limited resources.
With a single model based on a RandomForestRegressor, I achieved an
RMSE of 1.017. Enough to put me in 24th place in the competition.
But Wait, There’s More!
Let’s talk about ensembling. Ensembling is seemingly out of place in
this world of neural networks and deep learning. It is an amazingly
simple concept that can dramatically improve your model. At its
foundation, ensembling is just taking the average of several
predictions. The easiest way of doing this is making several iterations
of a model (that has some inherent randomness) and taking the average of
each of those models. If all the models agree on a single prediction
then it is a good prediction. If the models disagree on a single
prediction then some of the models are getting it wrong. Averaging
doesn’t do anything when the models agree. When they disagree, the
prediction is moved closer to the models that have come to an identical
conclusion over those that were confused. I found Pearson’s Correlation a
great test for figuring out whether two iterations had enough
variability.
I was able to get an even better ensemble by using weighted
averaging. I ranked each model iteration by performance and weighed it
accordingly in the averaging. An ExtraTreesClassifier was a great
performer in this regard. It scores an RMSE of 1.145 as an individual
model. But in a weighted iterative ensemble the RMSE jumps to 0.965
(lower is better). Moving me up to 16th place in the competition.
Averaging magic!
The following is the output of my iterative ensembling function.
score_lvl_1
iteration 0 MSE of 16.1733791749
iteration 1 MSE of 17.4862475442
iteration 2 MSE of 16.231827112
iteration 3 MSE of 16.2151277014
iteration 4 MSE of 16.7282252783
iteration 5 MSE of 15.9885396202
iteration 6 MSE of 16.6046168959
iteration 7 MSE of 16.7378847413
iteration 8 MSE of 18.0361820563
iteration 9 MSE of 16.8425016372
ensembled MSE of 12.5898952194
weighted ensembled MSE of 12.5797553676
score_lvl_2
iteration 0 MSE of 0.418631303209
iteration 1 MSE of 0.38785199738
iteration 2 MSE of 0.37098886706
iteration 3 MSE of 0.391617550753
iteration 4 MSE of 0.3701702685
iteration 5 MSE of 0.40635232482
iteration 6 MSE of 0.362311722331
iteration 7 MSE of 0.378683693517
iteration 8 MSE of 0.397838899804
iteration 9 MSE of 0.362966601179
ensembled MSE of 0.309328749181
weighted ensembled MSE of 0.30347545828
score_lvl_3
iteration 0 MSE of 2.37180746562
iteration 1 MSE of 2.1491486575
iteration 2 MSE of 2.4377865095
iteration 3 MSE of 2.13555992141
iteration 4 MSE of 2.04436804191
iteration 5 MSE of 2.06368696791
iteration 6 MSE of 2.46201702685
iteration 7 MSE of 2.46037982973
iteration 8 MSE of 2.16093647675
iteration 9 MSE of 2.20383104126
ensembled MSE of 1.8542632613
weighted ensembled MSE of 1.80534841178
ensembled contest metric of 0.964938365219
So Many Trees; I Die
Taking this concept one step further, I applied it to multiple
estimators rather than iterations of a single estimator. I included
several variations of the ExtraTreesClassifier since it was such a bully
in the iterative ensemble.
Yes, those are a lot of classifiers for what should be a numerical
model. I made the decision early on to treat this as a multi-class
classification problem. Yes, the number of violations were ordered, so
numerical/regression was the first thing that came to my mind. But as I
explored the data, I realized that the the number of violations was
finite. They couldn’t be less than zero and couldn’t be more than a
maximum. Even if you break every single rule, the number of violations
you could get was capped. Level two violations, being based on repeat
offenses, had very few classes.
This played out in the results. Individual regressors beat out the
classifiers, but when it came to ensembling, classifiers always came out
on top. Regressors are just there to bring up some of the variability.
With this multi-estimator ensemble, my RMSE moved from 0.965 to 0.827.
Now in 7th place, I created a new dataset consisting of the
predictions from the multi-estimator model as features. I fit this to a
held-out response using a GradientBoostingRegressor estimator
(LinearRegression also works well). Now my RMSE is as low as 0.725 and
I’m in 4th place.
Linear models based on eight numerical predictors and averaging. Ha! So simple.
Fin
With more time I would hope to improve this by running a gridsearchCV
to optimize hyper-parameters, as well as using pymc3 to build a truly
hierarchical model.
If you’d like to view the enormous amount of code that didn’t end up working, my github repository is here.
4th place out of 535 teams. Just a step shy of the podium, but I’ll take it. Not bad for my first time out of the gate.
This is a hands-on tutorial on deep learning. Step by step, we'll go
about building a solution for the Facial Keypoint Detection Kaggle
challenge.
The tutorial introduces Lasagne, a new library for building
neural networks with Python and Theano. We'll use Lasagne to
implement a couple of network architectures, talk about data
augmentation, dropout, the importance of momentum, and pre-training.
Some of these methods will help us improve our results quite a bit.
I'll assume that you already know a fair bit about neural nets.
That's because we won't talk about much of the background of how
neural nets work; there's a few of good books and videos for that,
like the Neural Networks and Deep Learning online book. Alec Radford's talk
Deep Learning with Python's Theano library is a great quick
introduction. Make sure you also check out Andrej Karpathy's
mind-blowing ConvNetJS Browser Demos.
You don't need to type the code and execute it yourself if you just
want to follow along. But here's the installation instructions for
those who have access to a CUDA-capable GPU and want to run the
experiments themselves.
I assume you have the CUDA toolkit, Python 2.7.x, numpy, pandas,
matplotlib, and scikit-learn installed. Lasagne is still waiting for
its first proper release, so for now we'll install it straight from
Github. To install Lasagne
and all the remaining dependencies, run this command:
(Note that for sake of brevity, I'm not including commands to create a
virtualenv and activate it.
But you should.)
If everything worked well, you should be able to find the
src/lasagne/examples/ directory in your virtualenv and run the
MNIST example. This
is sort of the "Hello, world" of neural nets. There's ten classes,
one for each digit between 0 and 9, and the input is grayscale images
of handwritten digits of size 28x28.
cd src/lasagne/examples/
python mnist.py
This command will start printing out stuff after thirty seconds or so.
The reason it takes a while is that Lasagne uses Theano to do the
heavy lifting; Theano in turn is a "optimizing GPU-meta-programming
code generating array oriented optimizing math compiler in Python,"
and it will generate C code that needs to be compiled before training
can happen. Luckily, we have to pay the price for this overhead only
on the first run.
Once training starts, you'll see output like this:
Epoch 1 of 500
training loss: 1.352731
validation loss: 0.466565
validation accuracy: 87.70 %
Epoch 2 of 500
training loss: 0.591704
validation loss: 0.326680
validation accuracy: 90.64 %
Epoch 3 of 500
training loss: 0.464022
validation loss: 0.275699
validation accuracy: 91.98 %
...
If you let training run long enough, you'll notice that after about 75
epochs, it'll have reached a test accuracy of around 98%.
If you have a GPU, you'll want to configure Theano to use it. For
this, create a ~/.theanorc file in your home directory and write
into it the following:
The training dataset for the Facial Keypoint Detection challenge
consists of 7,049 96x96 gray-scale images. For each image, we're
supposed learn to find the correct position (the x and y coordinates)
of 15 keypoints, such as left_eye_center,
right_eye_outer_corner, mouth_center_bottom_lip, and so on.
An example of one of the faces with three keypoints marked.
An interesting twist with the dataset is that for some of the
keypoints we only have about 2,000 labels, while other keypoints have
more than 7,000 labels available for training.
Let's write some Python code that loads the data from the CSV files
provided.
We'll write a function that can load both the training and the test
data. These two datasets differ in that the test data doesn't contain
the target values; it's the goal of the challenge to predict these.
Here's our load() function:
# file kfkd.pyimportosimportnumpyasnpfrompandas.io.parsersimportread_csvfromsklearn.utilsimportshuffleFTRAIN='~/data/kaggle-facial-keypoint-detection/training.csv'FTEST='~/data/kaggle-facial-keypoint-detection/test.csv'defload(test=False,cols=None):"""Loads data from FTEST if *test* is True, otherwise from FTRAIN. Pass a list of *cols* if you're only interested in a subset of the target columns. """fname=FTESTiftestelseFTRAINdf=read_csv(os.path.expanduser(fname))# load pandas dataframe# The Image column has pixel values separated by space; convert# the values to numpy arrays:df['Image']=df['Image'].apply(lambdaim:np.fromstring(im,sep=' '))ifcols:# get a subset of columnsdf=df[list(cols)+['Image']]print(df.count())# prints the number of values for each columndf=df.dropna()# drop all rows that have missing values in themX=np.vstack(df['Image'].values)/255.# scale pixel values to [0, 1]X=X.astype(np.float32)ifnottest:# only FTRAIN has any target columnsy=df[df.columns[:-1]].valuesy=(y-48)/48# scale target coordinates to [-1, 1]X,y=shuffle(X,y,random_state=42)# shuffle train datay=y.astype(np.float32)else:y=NonereturnX,yX,y=load()print("X.shape == {}; X.min == {:.3f}; X.max == {:.3f}".format(X.shape,X.min(),X.max()))print("y.shape == {}; y.min == {:.3f}; y.max == {:.3f}".format(y.shape,y.min(),y.max()))
It's not necessary that you go through every single detail of this
function. But let's take a look at what the script above outputs:
First it's printing a list of all columns in the CSV file along with
the number of available values for each. So while we have an
Image for all rows in the training data, we only have 2,267 values
for mouth_right_corner_x and so on. load() returns a tuple (X, y) where y is the target matrix.
y has shape n x m with n being the number of samples in the
dataset that have all m keypoints. Dropping all rows with missing
values is what this line does:
df=df.dropna()# drop all rows that have missing values in them
The script's output y.shape == (2140, 30) tells us that there's
only 2,140 images in the dataset that have all 30 target values
present. Initially, we'll train with these 2,140 samples only. Which
leaves us with many more input dimensions (9,216) than samples; an
indicator that overfitting might become a problem. Let's see. Of
course it's a bad idea to throw away 70% of the training data just
like that, and we'll talk about this later on.
Another feature of the load() function is that it scales the
intensity values of the image pixels to be in the interval [0, 1],
instead of 0 to 255. The target values (x and y coordinates) are
scaled to [-1, 1]; before they were between 0 to 95.
Now that we're done with the legwork of loading the data, let's use
Lasagne and create a neural net with a single hidden layer. We'll
start with the code:
# add to kfkd.pyfromlasagneimportlayersfromlasagne.updatesimportnesterov_momentumfromnolearn.lasagneimportNeuralNetnet1=NeuralNet(layers=[# three layers: one hidden layer('input',layers.InputLayer),('hidden',layers.DenseLayer),('output',layers.DenseLayer),],# layer parameters:input_shape=(None,9216),# 96x96 input pixels per batchhidden_num_units=100,# number of units in hidden layeroutput_nonlinearity=None,# output layer uses identity functionoutput_num_units=30,# 30 target values# optimization method:update=nesterov_momentum,update_learning_rate=0.01,update_momentum=0.9,regression=True,# flag to indicate we're dealing with regression problemmax_epochs=400,# we want to train this many epochsverbose=1,)X,y=load()net1.fit(X,y)
We use quite a few parameters to initialize the NeuralNet. Let's
walk through them. First there's the three layers and their
parameters:
layers=[# three layers: one hidden layer('input',layers.InputLayer),('hidden',layers.DenseLayer),('output',layers.DenseLayer),],# layer parameters:input_shape=(None,9216),# 96x96 input pixels per batchhidden_num_units=100,# number of units in hidden layeroutput_nonlinearity=None,# output layer uses identity functionoutput_num_units=30,# 30 target values
Here we define the input layer, the hidden layer and the
output layer. In parameter layers, we name and specify the
type of each layer, and their order. Parameters input_shape,
hidden_num_units, output_nonlinearity, and
output_num_units are each parameters for specific layers; they
refer to the layer by their prefix, such that input_shape defines
the shape parameter of the input layer, hidden_num_units
defines the hidden layer's num_units and so on. (It may seem a
little odd that we have to specify the parameters like this, but the
upshot is it buys us better compatibility with scikit-learn's pipeline and parameter search
features.)
We set the first dimension of input_shape to None. This
translates to a variable batch size.
We set the output_nonlinearity to None explicitly. Thus, the
output units' activations become just a linear combination of the
activations in the hidden layer.
The default nonlinearity used by DenseLayer is the rectifier,
which is simply max(0, x). It's the most popular choice of
activation function these days. By not explicitly setting
hidden_nonlinearity, we're choosing the rectifier as the
activiation function of our hidden layer.
The neural net's weights are initialized from a uniform distribution
with a cleverly chosen interval. That is, Lasagne figures out this
interval for us, using "Glorot-style" initialization.
There's a few more parameters. All parameters starting with
update parametrize the update function, or optimization method.
The update function will update the weights of our network after each
batch. We'll use the nesterov_momentum gradient descent
optimization method to do the job. There's a number of other methods
that Lasagne implements, such as adagrad and rmsprop. We
choose nesterov_momentum because it has proven to work very well
for a large number of problems.
The update_learning_rate defines how large we want the steps of
the gradient descent updates to be. We'll talk a bit more about the
learning_rate and momentum parameters later on. For now, it's
enough to just use these "sane defaults."
Comparison of a few optimization methods (animation by Alec
Radford).
The star denotes the global minimum on the error surface. Notice
that stochastic gradient descent (SGD) without momentum is the
slowest method to converge in this example. We're using Nesterov's
Accelerated Gradient Descent (NAG) throughout this tutorial.
In our definition of NeuralNet we didn't specify an objective
function to minimize. There's again a default for that; for
regression problems it's the mean squared error (MSE).
The last set of parameters declare that we're dealing with a
regression problem (as opposed to classification), that 400 is the
number of epochs we're willing to train, and that we want to print out
information during training by setting verbose=1:
regression=True,# flag to indicate we're dealing with regression problemmax_epochs=400,# we want to train this many epochsverbose=1,
Finally, the last two lines in our script load the data, just as
before, and then train the neural net with it:
X,y=load()net1.fit(X,y)
Running these two lines will output a table that grows one row per
training epoch. In each row, we'll see the current loss (MSE) on the
train set and on the validation set and the ratio between the two.
NeuralNet automatically splits the data provided in X into a
training and a validation set, using 20% of the samples for
validation. (You can adjust this ratio by overriding the
eval_size=0.2 parameter.)
On a reasonably fast GPU, we're able to train for 400 epochs in under
a minute. Notice that the validation loss keeps improving until the
end. (If you let it train longer, it will improve a little more.)
Now how good is a validation loss of 0.0032? How does it compare to
the challenge's benchmark
or the other entries in the leaderboard? Remember that we divided the
target coordinates by 48 when we scaled them to be in the interval
[-1, 1]. Thus, to calculate the root-mean-square error, as that's
what's used in the challenge's leaderboard, based on our MSE loss of
0.003255, we'll take the square root and multiply by 48 again:
This is reasonable proxy for what our score would be on the Kaggle
leaderboard; at the same time it's assuming that the subset of the
data that we chose to train with follows the same distribution as the
test set, which isn't really the case. My guess is that the score is
good enough to earn us a top ten place in the leaderboard at the time
of writing. Certainly not a bad start! (And for those of you that
are crying out right now because of the lack of a proper test set:
don't.)
The net1 object actually keeps a record of the data that it prints
out in the table. We can access that record through the
train_history_ attribute. Let's draw those two curves:
We can see that our net overfits, but it's not that bad. In
particular, we don't see a point where the validation error gets worse
again, thus it doesn't appear that early stopping, a technique
that's commonly used to avoid overfitting, would be very useful at
this point. Notice that we didn't use any regularization whatsoever,
apart from choosing a small number of neurons in the hidden layer, a
setting that will keep overfitting somewhat in control.
How do the net's predictions look like, then? Let's pick a few
examples from the test set and check:
LeNet5-style convolutional
neural nets are at the heart of deep learning's recent breakthrough in
computer vision. Convolutional layers are different to fully
connected layers; they use a few tricks to reduce the number of
parameters that need to be learned, while retaining high
expressiveness. These are:
local connectivity: neurons are connected only to a subset of
neurons in the previous layer,
weight sharing: weights are shared between a subset of neurons in
the convolutional layer (these neurons form what's called a feature
map),
Units in a convolutional layer actually connect to a 2-d patch of
neurons in the previous layer, a prior that lets them exploit the 2-d
structure in the input.
When using convolutional layers in Lasagne, we have to prepare the
input data such that each sample is no longer a flat vector of 9,216
pixel intensities, but a three-dimensional matrix with shape (c, 0,
1), where c is the number of channels (colors), and 0 and 1
correspond to the x and y dimensions of the input image. In our case,
the concrete shape will be (1, 96, 96), because we're dealing with a
single (gray) color channel only.
A function load2d that wraps the previously written load and
does the necessary transformations is easily coded:
We'll build a convolutional neural net with three convolutional layers
and two fully connected layers. Each conv layer is followed by a 2x2
max-pooling layer. Starting with 32 filters, we double the number of
filters with every conv layer. The densely connected hidden layers
both have 500 units.
There's again no regularization in the form of weight decay or
dropout. It turns out that using very small convolutional filters,
such as our 3x3 and 2x2 filters, is again a pretty good regularizer by
itself.
Let's write down the code:
net2=NeuralNet(layers=[('input',layers.InputLayer),('conv1',layers.Conv2DLayer),('pool1',layers.MaxPool2DLayer),('conv2',layers.Conv2DLayer),('pool2',layers.MaxPool2DLayer),('conv3',layers.Conv2DLayer),('pool3',layers.MaxPool2DLayer),('hidden4',layers.DenseLayer),('hidden5',layers.DenseLayer),('output',layers.DenseLayer),],input_shape=(None,1,96,96),conv1_num_filters=32,conv1_filter_size=(3,3),pool1_pool_size=(2,2),conv2_num_filters=64,conv2_filter_size=(2,2),pool2_pool_size=(2,2),conv3_num_filters=128,conv3_filter_size=(2,2),pool3_pool_size=(2,2),hidden4_num_units=500,hidden5_num_units=500,output_num_units=30,output_nonlinearity=None,update_learning_rate=0.01,update_momentum=0.9,regression=True,max_epochs=1000,verbose=1,)X,y=load2d()# load 2-d datanet2.fit(X,y)# Training for 1000 epochs will take a while. We'll pickle the# trained model so that we can load it back later:importcPickleaspicklewithopen('net2.pickle','wb')asf:pickle.dump(net2,f,-1)
Training this neural net is much more computationally costly than the
first one we trained. It takes around 15x as long to train; those
1000 epochs take more than 20 minutes on even a powerful GPU.
However, our patience is rewarded with what's already a much better
model than the one we had before. Let's take a look at the output
when running the script. First comes the list of layers with their
output shapes. Note that the first conv layer produces 32 output
images of size (94, 94), that's one 94x94 output image per filter:
The predictions of net1 on the left compared to the predictions of
net2.
And then let's compare the learning curves of the first and the second
network:
This looks pretty good, I like the smoothness of the new error curves.
But we do notice that towards the end, the validation error of net2
flattens out much more quickly than the training error. I bet we
could improve that by using more training examples. What if we
flipped the input images horizontically; would we be able to improve
training by doubling the amount of training data this way?
An overfitting net can generally be made to perform better by using
more training data. (And if your unregularized net does not overfit,
you should probably make it larger.)
Data augmentation lets us artificially increase the number of training
examples by applying transformations, adding noise etc. That's
obviously more economic than having to go out and collect more
examples by hand. Augmentation is a very useful tool to have in your
deep learning toolbox.
We mentioned batch iterators already briefly. It is the batch
iterator's job to take a matrix of samples, and split it up in
batches, in our case of size 128. While it does the splitting, the
batch iterator can also apply transformations to the data on the fly.
So to produce those horizontal flips, we don't actually have to double
the amount of training data in the input matrix. Rather, we will just
perform the horizontal flips with 50% chance while we're iterating
over the data. This is convenient, and for some problems it allows us
to produce an infinite number of examples, without blowing up the
memory usage. Also, transformations to the input images can be done
while the GPU is busy processing a previous batch, so they come at
virtually no cost.
Flipping the images horizontically is just a matter of using slicing:
X,y=load2d()X_flipped=X[:,:,:,::-1]# simple slice to flip all images# plot two images:fig=pyplot.figure(figsize=(6,3))ax=fig.add_subplot(1,2,1,xticks=[],yticks=[])plot_sample(X[1],y[1],ax)ax=fig.add_subplot(1,2,2,xticks=[],yticks=[])plot_sample(X_flipped[1],y[1],ax)pyplot.show()
Left shows the original image, right is the flipped image.
In the picture on the right, notice that the target value keypoints
aren't aligned with the image anymore. Since we're flipping the
images, we'll have to make sure we also flip the target values. To do
this, not only do we have to flip the coordinates, we'll also have to
swap target value positions; that's because the flipped
left_eye_center_x no longer points to the left eye in our flipped
image; now it corresponds to right_eye_center_x. Some points like
nose_tip_y are not affected. We'll define a tuple
flip_indices that holds the information about which columns in the
target vector need to swap places when we flip the image
horizontically. Remember the list of columns was:
Since left_eye_center_x will need to swap places with
right_eye_center_x, we write down the tuple (0, 2). Also
left_eye_center_y needs to swap places: with
right_eye_center_y. Thus we write down (1, 3), and so on. In
the end, we have:
flip_indices=[(0,2),(1,3),(4,8),(5,9),(6,10),(7,11),(12,16),(13,17),(14,18),(15,19),(22,24),(23,25),]# Let's see if we got it right:df=read_csv(os.path.expanduser(FTRAIN))fori,jinflip_indices:print("# {} -> {}".format(df.columns[i],df.columns[j]))# this prints out:# left_eye_center_x -> right_eye_center_x# left_eye_center_y -> right_eye_center_y# left_eye_inner_corner_x -> right_eye_inner_corner_x# left_eye_inner_corner_y -> right_eye_inner_corner_y# left_eye_outer_corner_x -> right_eye_outer_corner_x# left_eye_outer_corner_y -> right_eye_outer_corner_y# left_eyebrow_inner_end_x -> right_eyebrow_inner_end_x# left_eyebrow_inner_end_y -> right_eyebrow_inner_end_y# left_eyebrow_outer_end_x -> right_eyebrow_outer_end_x# left_eyebrow_outer_end_y -> right_eyebrow_outer_end_y# mouth_left_corner_x -> mouth_right_corner_x# mouth_left_corner_y -> mouth_right_corner_y
Our batch iterator implementation will derive from the default
BatchIterator class and override the transform() method only.
Let's see how it looks like when we put it all together:
fromnolearn.lasagneimportBatchIteratorclassFlipBatchIterator(BatchIterator):flip_indices=[(0,2),(1,3),(4,8),(5,9),(6,10),(7,11),(12,16),(13,17),(14,18),(15,19),(22,24),(23,25),]deftransform(self,Xb,yb):Xb,yb=super(FlipBatchIterator,self).transform(Xb,yb)# Flip half of the images in this batch at random:bs=Xb.shape[0]indices=np.random.choice(bs,bs/2,replace=False)Xb[indices]=Xb[indices,:,:,::-1]ifybisnotNone:# Horizontal flip of all x coordinates:yb[indices,::2]=yb[indices,::2]*-1# Swap places, e.g. left_eye_center_x -> right_eye_center_xfora,binself.flip_indices:yb[indices,a],yb[indices,b]=(yb[indices,b],yb[indices,a])returnXb,yb
To use this batch iterator for training, we'll pass it as the
batch_iterator_train argument to NeuralNet. Let's define
net3, a network that looks exactly the same as net2 except for
these lines at the very end:
Now we're passing our FlipBatchIterator, but we've also tripled
the number of epochs to train. While each one of our training epochs
will still look at the same number of examples as before (after all,
we haven't changed the size of X), it turns out that training
nevertheless takes quite a bit longer when we use our transforming
FlipBatchIterator. This is because what the network learns
generalizes better this time, and it's arguably harder to learn things
that generalize than to overfit.
So this will take maybe take an hour to train. Let's make sure we
pickle the model at the end of training, and then we're ready to go
fetch some tea and biscuits. Or maybe do the laundry:
Comparing the learning with that of net2, we notice that the error on
the validation set after 3000 epochs is indeed about 5% smaller for
the data augmented net. We can see how net2 stops learning anything
useful after 2000 or so epochs, and gets pretty noisy, while net3
continues to improve its validation error throughout, though slowly.
Still seems like a lot of work for only a small gain? We'll find out
if it was worth it in the next secion.
What's annoying about our last model is that it took already an hour
to train it, and it's not exactly inspiring to have to wait for your
experiment's results for so long. In this section, we'll talk about a
combination of two tricks to fix that and make the net train much
faster again.
An intuition behind starting with a higher learning rate and
decreasing it during the course of training is this: As we start
training, we're far away from the optimum, and we want to take big
steps towards it and learn quickly. But the closer we get to the
optimum, the lighter we want to step. It's like taking the train
home, but when you enter your door you do it by foot, not by train. On the importance of initialization and momentum in deep learning
is the title of a talk and a paper by Ilya Sutskever et al. It's
there that we learn about another useful trick to boost deep learning:
namely increasing the optimization method's momentum parameter during
training.
Remember that in our previous model, we initialized learning rate and
momentum with a static 0.01 and 0.9 respectively. Let's change that
such that the learning rate decreases linearly with the number of
epochs, while we let the momentum increase. NeuralNet allows us to update parameters during training using the
on_epoch_finished hook. We can pass a function to
on_epoch_finished and it'll be called whenever an epoch is
finished. However, before we can assign new values to
update_learning_rate and update_momentum on the fly, we'll
have to change these two parameters to become Theano shared
variables. Thankfully, that's pretty easy:
The callback or list of callbacks that we pass will be called with two
arguments: nn, which is the NeuralNet instance itself, and
train_history, which is the same as nn.train_history_.
Instead of working with callback functions that use hard-coded values,
we'll use a parametrizable class with a __call__ method as our
callback. Let's call this class AdjustVariable. The
implementation is reasonably straight-forward:
Cool, training is happening much faster now! The train error at
epochs 500 and 1000 is half of what it used to be in net2, before
our adjustments to learning rate and momentum. This time,
generalization seems to stop improving after 750 or so epochs already;
looks like there's no point in training much longer.
What about net5 with the data augmentation switched on?
And again we have much faster training than with net3, and
better results. After 1000 epochs, we're better off than net3 was
after 3000 epochs. What's more, the model trained with data
augmentation is now about 10% better with regard to validation error
than the one without.
Introduced in 2012 in the Improving neural networks by preventing
co-adaptation of feature detectors
paper, dropout is a popular regularization technique that works
amazingly well. I won't go into the details of why it works so well,
you can read about that elsewhere.
Like with any other regularization technique, dropout only makes sense
if we have a network that's overfitting, which is clearly the case for
the net5 network that we trained in the previous section. It's
important to remember to get your net to train nicely and overfit
first, then regularize.
To use dropout with Lasagne, we'll add DropoutLayer layers between
the existing layers and assign dropout probabilities to each one of
them. Here's the complete definition of our new net. I've added a
# ! comment at the end of those lines that were added between this
and net5.
Our network is sufficiently large now to crash Python's pickle with a
maximum recursion error. Therefore we have to increase Python's
recursion limit before we save it:
Also overfitting doesn't seem to be nearly as bad. Though we'll have
to be careful with those numbers: the ratio between training and
validation has a slightly different meaning now since the train error
is evaluated with dropout, whereas the validation error is evaluated
without dropout. A more comparable value for the train error is
this:
fromsklearn.metricsimportmean_squared_errorprintmean_squared_error(net6.predict(X),y)# prints something like 0.0010073791
In our previous model without dropout, the error on the train set was
0.000373. So not only does our dropout net perform slightly better,
it overfits much less than what we had before. That's great news,
because it means that we can expect even better performance when we
make the net larger (and more expressive). And that's what we'll try
next: we increase the number of units in the last two hidden layers
from 500 to 1000. Update these lines:
And we're still looking really good with the overfitting! My feeling
is that if we increase the number of epochs to train, this model might
become even better. Let's try it:
Remember those 70% of training data that we threw away in the
beginning? Turns out that's a very bad idea if we want to get a
competitive score in the Kaggle leaderboard. There's quite a bit of
variance in those 70% of data and in the challenge's test set that our
model hasn't seen yet.
So instead of training a single model, let's train a few specialists,
with each one predicting a different set of target values. We'll
train one model that only predicts left_eye_center and
right_eye_center, one only for nose_tip and so on; overall,
we'll have six models. This will allow us to use the full training
dataset, and hopefully get a more competitive score overall.
The six specialists are all going to use exactly the same network
architecture (a simple approach, not necessarily the best). Because
training is bound to take much longer now than before, let's think
about a strategy so that we don't have to wait for max_epochs to
finish, even if the validation error stopped improving much earlier.
This is called early stopping, and we'll write another
on_epoch_finished callback to take care of that. Here's the
implementation:
classEarlyStopping(object):def__init__(self,patience=100):self.patience=patienceself.best_valid=np.infself.best_valid_epoch=0self.best_weights=Nonedef__call__(self,nn,train_history):current_valid=train_history[-1]['valid_loss']current_epoch=train_history[-1]['epoch']ifcurrent_valid<self.best_valid:self.best_valid=current_validself.best_valid_epoch=current_epochself.best_weights=nn.get_all_params_values()elifself.best_valid_epoch+self.patience<current_epoch:print("Early stopping.")print("Best valid loss was {:.6f} at epoch {}.".format(self.best_valid,self.best_valid_epoch))nn.load_params_from(self.best_weights)raiseStopIteration()
You can see that there's two branches inside the __call__: the
first where the current validation score is better than what we've
previously seen, and the second where the best validation epoch was
more than self.patience epochs in the past. In the first case we
store away the weights:
self.best_weights=nn.get_all_params_values()
In the second case, we set the weights of the network back to those
best_weights before raising StopIteration, signalling to
NeuralNet that we want to stop training.
We already discussed the need for flip_indices in the Data
augmentation section. Remember from section The data that our
load_data() function takes an optional list of columns to extract.
We'll make use of this feature when we fit the specialist models in a
new function fit_specialists():
fromcollectionsimportOrderedDictfromsklearn.baseimportclonedeffit_specialists():specialists=OrderedDict()forsettinginSPECIALIST_SETTINGS:cols=setting['columns']X,y=load2d(cols=cols)model=clone(net)model.output_num_units=y.shape[1]model.batch_iterator_train.flip_indices=setting['flip_indices']# set number of epochs relative to number of training examples:model.max_epochs=int(1e7/y.shape[0])if'kwargs'insetting:# an option 'kwargs' in the settings list may be used to# set any other parameter of the net:vars(model).update(setting['kwargs'])print("Training model for columns {} for {} epochs".format(cols,model.max_epochs))model.fit(X,y)specialists[cols]=modelwithopen('net-specialists.pickle','wb')asf:# we persist a dictionary with all models:pickle.dump(specialists,f,-1)
There's nothing too spectacular happening here. Instead of training
and persisting a single model, we do it with a list of models that are
saved in a dictionary that maps columns to the trained NeuralNet
instances. Now despite our early stopping, this will still take
forever to train (though by forever I don't mean Google-forever, I mean
maybe half a day on a single GPU); I don't recommend that you actually
run this.
We could of course easily parallelize training these specialist nets
across GPUs, but maybe you don't have the luxury of access to a box
with multiple CUDA GPUs. In the next section we'll talk about another
way to cut down on training time. But let's take a look at the
results of fitting these expensive to train specialists first:
Learning curves for six specialist models. The solid lines
represent RMSE on the validation set, the dashed lines errors on
the train set. mean is the mean validation error of all nets
weighted by number of target values. All curves have been scaled
to have the same length on the x axis.
Lastly, this solution gives us a Kaggle leaderboard
score of 2.17 RMSE, which corresponds to the second place at the
time of writing (right behind yours truly).
In the last section of this tutorial, we'll discuss a way to make
training our specialists faster. The idea is this: instead of
initializing the weights of each specialist network at random, we'll
initialize them with the weights that were learned in net6 or
net7. Remember from our EarlyStopping implementation that
copying weights from one network to another is as simple as using the
load_params_from() method. Let's modify the fit_specialists
method to do just that. I'm again marking the lines that changed
compared to the previous implementation with a # ! comment:
deffit_specialists(fname_pretrain=None):iffname_pretrain:# !withopen(fname_pretrain,'rb')asf:# !net_pretrain=pickle.load(f)# !else:# !net_pretrain=None# !specialists=OrderedDict()forsettinginSPECIALIST_SETTINGS:cols=setting['columns']X,y=load2d(cols=cols)model=clone(net)model.output_num_units=y.shape[1]model.batch_iterator_train.flip_indices=setting['flip_indices']model.max_epochs=int(4e6/y.shape[0])if'kwargs'insetting:# an option 'kwargs' in the settings list may be used to# set any other parameter of the net:vars(model).update(setting['kwargs'])ifnet_pretrainisnotNone:# !# if a pretrain model was given, use it to initialize the# weights of our new specialist model:model.load_params_from(net_pretrain)# !print("Training model for columns {} for {} epochs".format(cols,model.max_epochs))model.fit(X,y)specialists[cols]=modelwithopen('net-specialists.pickle','wb')asf:# this time we're persisting a dictionary with all models:pickle.dump(specialists,f,-1)
It turns out that initializing those nets not at random, but by
re-using weights from one of the networks we learned earlier has in
fact two big advantages: One is that training converges much faster;
maybe four times faster in this case. The second advantage is that it
also helps get better generalization; pre-training acts as a
regularizer. Here's the same learning curves as before, but now for
the pre-trained nets:
Learning curves for six specialist models that were pre-trained.
Finally, the score for this solution on the challenge's leaderboard is
2.13 RMSE. Again the second place, but getting closer!
There's probably a dozen ideas that you have that you want to try out.
You can find the source code for the final solution here
to download and play around with. It also includes the bit that
generates a submission file for the Kaggle challenge. Run python
kfkd.py to find out how to use the script on the command-line.
Here's a couple of the more obvious things that you could try out at
this point: Try optimizing the parameters for the individual
specialist networks; this is something that we haven't done so far.
Observe that the six nets that we trained all have different levels of
overfitting. If they're not or hardly overfitting, like for the green
and the yellow net above, you could try to decrease the amount of
dropout. Likewise, if it's overfitting badly, like the black and
purple nets, you could try increasing the amount of dropout. In the
definition of SPECIALIST_SETTINGS we can already add some
net-specific settings; so say we wanted to add more regularization to
the second net, then we could change the second entry of the list to
look like so:
And there's a ton of other things that you could try to tweak. Maybe
you'll try adding another convolutional or fully connected layer? I'm
curious to hear about improvements that you're able to come up with in
the comments. Edit: Kaggle features this tutorial on their site
where they've included instructions on how to use Amazon GPU instances
to run the tutorial, which is useful if you don't own a CUDA-capable
GPU.