Friday, April 17, 2015

Competing in a data science contest without reading the data

http://blog.mrtz.org/2015/03/09/competition.html

Machine learning competitions have become an extremely popular format for solving prediction and classification problems of all sorts. The most famous example is perhaps the Netflix prize. An even better example is Kaggle, an awesome startup that’s organized more than a hundred competitions over the past few years.
The central component of any competition is the public leaderboard. Competitors can repeatedly submit a list of predictions and see how their predictions perform on a set of holdout labels not available to them. The leaderboard ranks all teams according to their prediction accuracy on the holdout labels. Once the competition closes all teams are scored on a final test set not used so far. The resulting ranking, called private leaderboard, determines the winner.
Heritage Prize public leaderboard
Public leaderboard of the Heritage Health Prize (Source)
In this post, I will describe a method to climb the public leaderboard without even looking at the data. The algorithm is so simple and natural that an unwitting analyst might just run it. We will see that in Kaggle’s famous Heritage Health Prize competition this might have propelled a participant from rank around 150 into the top 10 on the public leaderboard without making progress on the actual problem. The Heritage Health Prize competition ran for two years and had a prize pool of 3 million dollars. Keep in mind though that the standings on the public leaderboard do not affect who gets the money.
The point of this post is to illustrate why maintaining a leaderboard that accurately reflects the true performance of each team is a difficult and deep problem. While there are decades of work on estimating the true performance of a model (or set of models) from a finite sample, the leaderboard application highlights some challenges that while fundamental have only recently seen increased attention. A follow-up post will describe a recent paper with Avrim Blum that gives an algorithm for maintaining a (provably) accurate public leaderboard.
Let me be very clear that my point is not to criticize Kaggle or anyone else organizing machine learning competitions. On the contrary, I’m amazed by how well Kaggle competitions work. In my opinion, they have contributed a tremendous amount of value to both industry and education. I also know that Kaggle has some very smart people thinking hard about how to anticipate problems with competitions.

The Kaggle leaderboard mechanism

At first sight, the Kaggle mechanism looks like the classic holdout method. Kaggle partitions the data into two sets: a training set and a holdout set. The training set is publicly available with both the individual instances and their corresponding class labels. The instances of the holdout set are publicly available as well, but the class labels are withheld. Predicting these missing class labels is the goal of the participant and a valid submission is a list of labels—one for each point in the holdout set.
Kaggle specifies a score function that maps a submission consisting of N labels to a numerical score, which we assume to be in [0,1]. Think of the score as prediction error (smaller is better). For concreteness, let’s fix it to be the misclassification rate. That is a prediction incurs loss 0 if it matches the corresponding unknown label and loss 1 if it does not match it. We divide by the number of predictions to get a score in [0,1].
Kaggle further splits its \(N\) private labels randomly into \(n\) holdout labels and \(N-n\) test labels. Typically, \(n=0.3N\). The public leaderboard is a sorting of all teams according to their score computed only on the \(n\) holdout labels (without using the test labels), while the private leaderboard is the ranking induced by the test labels. I will let \(s_H(y)\) denote the public score of a submission \(y\), i.e., the score according to the public leaderboard. Typically, Kaggle rounds all scores to 5 or 6 digits of precision.

The cautionary tale of wacky boosting

Imagine your humble blogger in a parallel universe: I’m new to this whole machine learning craze. So, I sign up for a Kaggle competition to get some skills. Kaggle tells me that there’s an unknown set of labels \(y\in\{0,1\}^N\) that I need to predict. Well, I know nothing about \(y\). So here’s what I’m going to do. I try out a bunch of random vectors and keep all those that give me a slightly better than expected score. If we’re talking about misclassification rate, the expected score of a random binary vector is 0.5. So, I’m keeping all the vectors with score less than 0.5. Then I recall something about boosting. It tells me that I can boost my accuracy by aggregating all predictors into a single predictor using the majority function. Slightly more formally, here’s what I do:
Algorithm (Wacky Boosting):
  1. Choose \(y_1,\dots,y_k\in\{0,1\}^N\) uniformly at random.
  2. Let \(I = \{ i\in[k] \colon s_H(y_i) < 0.5 \}\).
  3. Output \(\hat y=\mathrm{majority} \{ y_i \colon i \in I \} \), where the majority is component-wise.
Lo and behold, this is what happens:
In this plot, \(n=4000\) and all numbers are averaged over 5 independent repetitions.
As I’m only seeing the public score (bottom red line), I get super excited. I keep climbing the leaderboard! Who would’ve thought that this machine learning thing was so easy? So, I go write a blog post on Medium about Big Data and score a job at DeepCompeting.ly, the latest data science startup in the city. Life is pretty sweet. I pick up indoor rock climbing, sign up for wood working classes; I read Proust and books about espresso. Two months later the competition closes and Kaggle releases the final score. What an embarrassment! Wacky boosting did nothing whatsoever on the final test set. I get fired from DeepCompeting.ly days before the buyout. My spouse dumps me. The lease expires. I get evicted from my apartment in the Mission. Inevitably, I hike the Pacific Crest Trail and write a novel about it.

What just happened

Let’s understand what went wrong and how you can avoid hiking the Pacific Crest Trail. To start out with, each \(y_i\) has loss around \(1/2\pm1/\sqrt{n}\). We’re selecting the ones that are biased below a half. This introduces a bias in the score and the conditional expected bias of each selected vector \(w_i\) is roughly \(1/2-c/\sqrt{n}\) for some positive constant \(c>0\). Put differently, each selected \(y_i\) is giving us a guess about each label in the unknown holdout set \(H\subseteq [N]\) that’s correct with probability \(1/2 + \Omega(1/\sqrt{n})\). Since the public score doesn’t depend on labels outside of \(H\), the conditioning does not affect the final test set. The labels outside of \(H\) are still unbiased. Finally, we need to argue that the majority vote “boosts” our slightly biased coin tosses into a stronger bias. More formally, we can show that \(\hat y\) gives us a guess for each label in \(H\) that’s correct with probability \[ \frac12 + \Omega\left(\sqrt{k/n}\right). \] Hence, the public score of \(y\) satisfies \[ s_H(y) < \frac12 - \Omega\left(\sqrt{k/n}\right). \] Outside of \(H\), however, we’re just random guessing with no advantage. To summarize, wacky boosting gives us a bias of \(\sqrt{k}\) standard deviations on the public score with \(k\) submissions.
What’s important is that the same algorithm still “works” even if we don’t get exact answers. All we need are answers that are accurate to an additive error of \(1/\sqrt{n}\). This is important since Kaggle rounds its answers to 5 digits of precision. In particular, this attack will work so long as \(n< 10^{10}\).

Why the holdout method breaks down

The idea behind the holdout method is that the holdout data serve as a fresh sample providing an unbiased and well-concentrated estimate of the true loss of the classifier on the underlying distribution. Why then didn’t the holdout method detect that our wacky boosting algorithm was overfitting? The short answer is that the holdout method is simply not valid in the way it’s used in a competition.
One point of departure from the classic method is that the participants actually do see the data points corresponding to holdout labels which can lead to some problems. But that’s not the issue here and even if they we don’t look at the holdout data points at all, there’s a fundamental reason why the validity of the classic holdout method breaks down.
The problem is that a submission in general incorporates information about the holdout labels previously released through the leaderboard mechanism. As a result, there is a statistical dependence between the holdout data and the submission. Due to this feedback loop, the public score is in general no longer an unbiased estimate of the true score. There is no reason not to expect the submissions to eventually overfit to the holdout set.
The problem of overfitting to the holdout set is well known. Kaggle’s forums are full of anecdotal evidence reported by various competitors. The primary way Kaggle deals with this problem is by limiting the rate of re-submission and (to some extent) the bit precision of the answers. Of course, this is also the reason why the winners are determined on a separate test set.

Static vs interactive data analysis

Kaggle’s liberal use of the holdout method is just one example of a widespread disconnect between the theory of static data analysis and the practice of interactive data analysis. The holdout method is a static method in that it assumes the model to be independent of the holdout data on which it is evaluated. However, machine learning competitions are interactive, because submissions generally incorporate information from the holdout set.
Static vs Interactive
I contend that most real world data analysis is interactive. Unfortunately, most of the theory on model validation and statistical estimation falls into the static setting requiring independence between method and holdout data. This divide is not inherent though. Indeed, my next post deals with some useful theory for the interactive setting.

The Heritage Health Prize leaderboard

Let’s see how this could’ve been applied to an actual competition. Of course, the Heritage Health Prize competition is long over. We’re about two years too late to the party. Besides, we don’t have the solution file for that competition. Without it there’s no sure way of knowing how well this approach would’ve worked. Nevertheless, we can make some reasonable modeling of what the holdout labels might look like using information that was released by Kaggle and see how well we’d be doing against our random model.

Generalized wacky boosting

Before we can apply wacky boosting to the Heritage prize, we need to clear two obstacles. First, wacky boosting required the domain to be Boolean whereas the labels could be arbitrary positive real numbers. Second, the algorithm only gave an advantage over random guessing which might be too far from the top of the leaderboard to start out with. It turns out that both of these issues can be resolved nicely with a simple generalization of the previous algorithm. What was really happening in the algorithm is that we had two candidate solutions, the all ones vector and the all zeros vector, and we tried out random coordinate-wise combinations of these vectors. The algorithm ends up finding a coordinate wise combination of the two vectors that improves upon their mean loss, i.e., one half. This way of looking at it generalizes nicely. The resulting algorithm is just a few lines of Julia code.
# select coordinate from v1 where v is 1 and from v2 where v is 0
combine(v,v1,v2) = v1 .* v + v2 .* (1-v)

function wackyboost(v1,v2,k,score)
    m = mean([score(v1),score(v2)])
    A = rand(0:1,(length(v1),k))
    # select columns of A that give better than mean score
    a = filter(i -> score(combine(A[:,i],v1,v2)) < m,[1:k])
    # take majority vote over all selected columns
    v = float(A[:,a] * ones(length(a)) .> length(a)/2)
    return combine(v,v1,v2)
end
I worked through the fun exercise of applying this algorithm in a separate Julia notebook. Here’s one picture that came out of it. Don’t treat the numbers as definitive as they all depend on the modeling assumptions I made.
We see an improvement from 0.462311 (rank 146) to 0.451868 (rank 6).
The bottom line is: It seems to work reasonably well (under various semi-principled modeling assumptions I made). From the looks of it this might have given you an improvement from rank 150ish to 6ish within 700 submissions. Note there was a single team with 671 submissions. There’s a pretty good gap between number one on the public leaderboard and the rest. While possible in principle, it took me a bunch more submissions to get to the top. I should say though that I used the completely generic code from above without any optimizations specific to the competition. I didn’t even look at the data points (as I don’t have them). It’s possible that using the data and domain knowledge could improve things much further. I chose the Heritage Health prize, because it was the highest prized Kaggle competition ever (3 million dollars) and it ran for two years with a substantial number of submissions.

How robust is your benchmark?

There’s a broad lesson to be learned from this example. As computer scientists we love numerical benchmarks and rankings of all sorts. They look so objective and scientific that we easily forget how any benchmark is just a proxy for a more complex question. Every once in a while we should step back and ask: How robust is the benchmark? Do improvements in our benchmark really correspond to progress on the original problem? What I’d love to see in all empirical branches of computer science are adversarial robustness evaluations of various benchmarks. How far can we get by gaming rather than by actually making progress towards solving the problem?
Let me end on a positive note. What excites me is that the serious problems we saw in this post actually do have a fix (both in theory and in practice)! So, stay tuned for my next post.

Sunday, April 12, 2015

Deep Learning vs Probabilistic Graphical Models vs Logic

http://quantombone.blogspot.com/2015/04/deep-learning-vs-probabilistic.html

Today, let's take a look at three paradigms that have shaped the field of Artificial Intelligence in the last 50 years: Logic, Probabilistic Methods, and Deep Learning. The empirical, "data-driven", or big-data / deep-learning ideology triumphs today, but that wasn't always the case. Some of the earliest approaches to AI were based on Logic, and the transition from logic to data-driven methods has been heavily influenced by probabilistic thinking, something we will be investigating in this blog post.

Let's take a look back Logic and Probabilistic Graphical Models and make some predictions on where the field of AI and Machine Learning is likely to go in the near future. We will proceed in chronological order.


Image from Coursera's Probabilistic Graphical Models course

1. Logic and Algorithms (Common-sense "Thinking" Machines)


A lot of early work on Artificial Intelligence was concerned with Logic, Automated Theorem Proving, and manipulating symbols. It should not be a surprise that John McCarthy's seminal 1959 paper on AI had the title "Programs with common sense."

If we peek inside one of most popular AI textbooks, namely "Artificial Intelligence: A Modern Approach," we immediately notice that the beginning of the book is devoted to search, constraint satisfaction problems, first order logic, and planning. The third edition's cover (pictured below) looks like a big chess board (because being good at chess used to be a sign of human intelligence), features a picture of Alan Turing (the father of computing theory) as well as a picture of Aristotle (one of the greatest classical philosophers which had quite a lot to say about intelligence).

The cover of AIMA, the canonical AI text for undergraduate CS students

Unfortunately, logic-based AI brushes the perception problem under the rug, and I've argued quite some time ago that understanding how perception works is really the key to unlocking the secrets of intelligence. Perception is one of those things which is easy for humans and immensely difficult for machines. (To read more see my 2011 blog post, Computer Vision is Artificial Intelligence). Logic is pure and traditional chess-playing bots are very algorithmic and search-y, but the real world is ugly, dirty, and ridden with uncertainty.

I think most contemporary AI researchers agree that Logic-based AI is dead. The kind of world where everything can be perfectly observed, a world with no measurement error, is not the world of robotics and big-data.  We live in the era of machine learning, and numerical techniques triumph over first-order logic.  As of 2015, I pity the fool who prefers Modus Ponens over Gradient Descent.

Logic is great for the classroom and I suspect that once enough perception problems become "essentially solved" that we will see a resurgence in Logic.  And while there will be plenty of open perception problems in the future, there will be scenarios where the community can stop worrying about perception and start revisiting these classical ideas. Perhaps in 2020.

Further reading: Logic and Artificial Intelligence from the Stanford Encyclopedia of Philosophy

2. Probability, Statistics, and Graphical Models ("Measuring" Machines)


Probabilistic methods in Artificial Intelligence came out of the need to deal with uncertainty. The middle part of the Artificial Intelligence a Modern Approach textbook is called "Uncertain Knowledge and Reasoning" and is a great introduction to these methods.  If you're picking up AIMA for the first time, I recommend you start with this section. And if you're a student starting out with AI, do yourself a favor and don't skimp on the math.

Intro to PDFs from Penn State's Probability Theory and Mathematical Statistics course

When most people think about probabilistic methods they think of counting.  In laymen's terms it's fair to think of probabilistic methods as fancy counting methods.  Let's briefly take a look at what used to be the two competing methods for thinking probabilistically.

Frequentist methods are very empirical -- these methods are data-driven and make inferences purely from data.  Bayesian methods are more sophisticated and combine data-driven likelihoods with magical priors.  These priors often come from first principles or "intuitions" and the Bayesian approach is great for combining heuristics with data to make cleverer algorithms -- a nice mix of the rationalist and empiricist world views.

What is perhaps more exciting that then Frequentist vs. Bayesian flamewar, is something known as Probabilistic Graphical Models.  This class of techniques comes from computer science, and even though Machine Learning is now a strong component of a CS and a Statistics degree, the true power of statistics only comes when it is married with computation.

Probabilistic Graphical Models are a marriage of Graph Theory with Probabilistic Methods and they were all the rage among Machine Learning researchers in the mid 2000s. Variational methods, Gibbs Sampling, and Belief Propagation were being pounded into the brains of CMU graduate students when I was in graduate school (2005-2011) and provided us with a superb mental framework for thinking about machine learning problems. I learned most of what I know about Graphical Models from Carlos Guestrin and Jonathan Huang. Carlos Guestrin is now the CEO of GraphLab, Inc (now known as Dato) which is a company that builds large scale products for machine learning on graphs and Jonathan Huang is a senior research scientist at Google.

The video below is a high level overview of GraphLab, but it serves a very nice overview of "graphical thinking" and how it fits into the modern data scientist's tool-belt. Carlos is an excellent lecturer and his presentation is less about the company's product and more about ways for thinking about next generation machine learning systems.

A Computational Introduction to Probabilistic Graphical Models
by GraphLab, Inc CEO Prof. Carlos Guestrin

If you think that deep learning is going to solve all of your machine learning problems, you should really take a look at the above video.  If you're building recommender systems, an analytics platform for healthcare data, designing a new trading algorithm, or building the next generation search engine, Graphical Models are perfect place to start.

Further reading:
Belief Propagation Algorithm Wikipedia Page
An Introduction to Variational Methods for Graphical Models by Michael Jordan et al.
Michael Jordan's webpage (one of the titans of inference and graphical models)

3. Deep Learning and Machine Learning (Data-Driven Machines)

Machine Learning is about learning from examples and today's state-of-the-art recognition techniques require a lot of training data, a deep neural network, and patience. Deep Learning emphasizes the network architecture of today's most successful machine learning approaches.  These methods are based on "deep" multi-layer neural networks with many hidden layers. NOTE: I'd like to emphasize that using deep architectures (as of 2015) is not new.  Just check out the following "deep" architecture from 1998.

LeNet-5 Figure From Yann LeCun's seminal "Gradient-based learning
applied to document recognition" paper.

When you take a look at modern guide about LeNet, it comes with the following disclaimer:

"To run this example on a GPU, you need a good GPU. It needs at least 1GB of GPU RAM. More may be required if your monitor is connected to the GPU.

When the GPU is connected to the monitor, there is a limit of a few seconds for each GPU function call. This is needed as current GPUs can’t be used for the monitor while doing computation. Without this limit, the screen would freeze for too long and make it look as if the computer froze. This example hits this limit with medium-quality GPUs. When the GPU isn’t connected to a monitor, there is no time limit. You can lower the batch size to fix the time out problem."

It really makes me wonder how Yann was able to get anything out of his deep model back in 1998. Perhaps it's not surprising that it took another decade for the rest of us to get the memo.

UPDATE: Yann pointed out (via a Facebook comment) that the ConvNet work dates back to 1989. "It had about 400K connections and took about 3 weeks to train on the USPS dataset (8000 training examples) on a SUN4 machine." -- LeCun



NOTE: At roughly the same time (~1998) two crazy guys in California were trying to cache the entire internet inside the computers in their garage (they started some funny-sounding company which starts with a G). I don't know how they did it, but I guess sometimes to win big you have to do things that don't scale. Eventually the world will catch up.

Further reading:
Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognitionProceedings of the IEEE, November 1998.

Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard and L. D. Jackel: Backpropagation Applied to Handwritten Zip Code Recognition, Neural Computation, 1(4):541-551, Winter 1989

Deep Learning code: Modern LeNet implementation in Theano and docs.


Conclusion

I don't see traditional first-order logic making a comeback anytime soon. And while there is a lot of hype behind deep learning, distributed systems and "graphical thinking" is likely to make a much more profound impact on data science than heavily optimized CNNs. There is no reason why deep learning can't be combined with a GraphLab-style architecture, and some of the new exciting machine learning work in the next decade is likely to be a marriage of these two philosophies.

You can also check out a relevant post from last month:
Deep Learning vs Machine Learning vs Pattern Recognition