http://karpathy.github.io/2015/10/25/selfie/
Convolutional Neural Networks are great: they recognize things, places and people in your personal photos, signs, people and lights in self-driving cars, crops, forests and traffic in aerial imagery, various anomalies in medical images and all kinds of other useful things. But once in a while these powerful visual recognition models can also be warped for distraction, fun and amusement. In this fun experiment we're going to do just that: We'll take a powerful, 140-million-parameter state-of-the-art Convolutional Neural Network, feed it 2 million selfies from the internet, and train it to classify good selfies from bad ones. Just because it's easy and because we can. And in the process we might learn how to take better selfies :)
Yeah, I'll do real work. But first, let me tag a #selfie.
Convolutional Neural Networks
Before we dive in I thought I should briefly describe what Convolutional Neural Networks (or ConvNets for short) are in case a slightly more general audience reader stumbles by. Basically, ConvNets are a very powerful hammer, and Computer Vision problems are very nails. If you're seeing or reading anything about a computer recognizing things in images or videos, in 2015 it almost certainly involves a ConvNet. Some examples:
Few of many examples of ConvNets being useful. From
top left and clockwise: Classifying house numbers in Street View images,
recognizing bad things in medical images, recognizing Chinese
characters, traffic signs, and faces.
I happened to witness this critical juncture in time first hand because the ImageNet challenge was over the last few years organized by Fei-Fei Li's lab (my lab), so I remember when my labmate gasped in disbelief as she noticed the (very strong) ConvNet submission come up in the submission logs. And I remember us pacing around the room trying to digest what had just happened. In the next few months ConvNets went from obscure models that were shrouded in skepticism to rockstars of Computer Vision, present as a core building block in almost every new Computer Vision paper. The ImageNet challenge reflects this trend - In the 2012 ImageNet challenge there was only one ConvNet entry, and since then in 2013 and 2014 almost all entries used ConvNets. Also, fun fact, the winning team each year immediately incorporated into a company.
Over the next few years we had perfected, simplified, and scaled up the original 2012 "AlexNet" architecture (yes, we give them names). In 2013 there was the "ZFNet", and then in 2014 the "GoogLeNet" (get it? Because it's like LeNet but from Google? hah) and "VGGNet". Anyway, what we know now is that ConvNets are:
- simple: one operation is repeated over and over few tens of times starting with the raw image.
- fast, processing an image in few tens of milliseconds
- they work very well (e.g. see this post where I struggle to classify images better than the GoogLeNet)
- and by the way, in some ways they seem to work similar to our own visual cortex (see e.g. this paper)
Under the hood
So how do they work? When you peek under the hood you'll find a very simple computational motif repeated over and over. The gif below illustrates the full computational process of a small ConvNet:
Illustration of the inference process.
Now, I explained the first column of activations right after the image, so what's with all the other columns that appear over time? They are the exact same operation repeated over and over, once to get each new column. The next columns will correspond to yet another set of filters being applied to the previous column's responses, gradually detecting more and more complex visual patterns until the last set of filters is computing the probability of entire visual classes (e.g. dog/toad) in the image. Clearly, I'm skimming over some parts but that's the basic gist: it's just convolutions from start to end.
Training. We've seen that a ConvNet is a large collection of filters that are applied on top of each other. But how do we know what the filters should be looking for? We don't - we initialize them all randomly and then train them over time. For example, we feed an image to a ConvNet with random filters and it might say that it's 54% sure that's a dog. Then we can tell it that it's in fact a toad, and there is a mathematical process for changing all filters in the ConvNet a tiny amount so as to make it slightly more likely to say toad the next time it sees that same image. Then we just repeat this process tens/hundreds of millions of times, for millions of images. Automagically, different filters along the computational pathway in the ConvNet will gradually tune themselves to respond to important things in the images, such as eyes, then heads, then entire bodies etc.
Examples of what 12 randomly chosen filters in a trained ConvNet get excited about, borrowed from Matthew Zeiler's Visualizing and Understanding Convolutional Networks.
Filters shown here are in the 3rd stage of processing and seem to look
for honey-comb like patterns, or wheels/torsos/text, etc. Again, we
don't specify this; It emerges by itself and we can inspect it.
Training a ConvNet
The nice thing about ConvNets is that you can feed them images of whatever you like (along with some labels) and they will learn to recognize those labels. In our case we will feed a ConvNet some good and bad selfies, and it will automagically find the best things to look for in the images to tell those two classes apart. So lets grab some selfies:- I wrote a quick script to gather images tagged with #selfie. I ended up getting about 5 million images (with ConvNets it's the more the better, always).
- I narrowed that down with another ConvNet to about 2 million images that contain at least one face.
- Now it is time to decide which ones of those selfies are good or bad. Intuitively, we want to calculate a proxy for how many people have seen the selfie, and then look at the number of likes as a function of the audience size. I took all the users and sorted them by their number of followers. I gave a small bonus for each additional tag on the image, assuming that extra tags bring more eyes. Then I marched down this sorted list in groups of 100, and sorted those 100 selfies based on their number of likes. I only used selfies that were online for more than a month to ensure a near-stable like count. I took the top 50 selfies and assigned them as positive selfies, and I took the bottom 50 and assigned those to negatives. We therefore end up with a binary split of the data into two halves, where we tried to normalize by the number of people who have probably seen each selfie. In this process I also filtered people with too few followers or too many followers, and also people who used too many tags on the image.
- Take the resulting dataset of 1 million good and 1 million bad selfies and train a ConvNet.
Example images showing good and bad selfies in our training data. These will be given to the ConvNet as teaching material.
What makes a good #selfie ?
Okay, so we collected 2 million selfies, decided which ones are probably good or bad based on the number of likes they received (controlling for the number of followers), fed all of it to Caffe and trained a ConvNet. The ConvNet "looked" at every one of the 2 million selfies several tens of times, and tuned its filters in a way that best allows it to separate good selfies from bad ones. We can't very easily inspect exactly what it found (it's all jumbled up in 140 million numbers that together define the filters). However, we can set it loose on selfies that it has never seen before and try to understand what it's doing by looking at which images it likes and which ones it does not.I took 50,000 selfies from my test data (i.e. the ConvNet hasn't seen these before). As a first visualization, in the image below I am showing a continuum visualization, with the best selfies on the top row, the worst selfies on the bottom row, and every row in between is a continuum:
A continuum from best (top) to worst (bottom) selfies, as judged by the ConvNet.
Best 100 out of 50,000 selfies, as judged by the Convolutional Neural Network.
- Be female. Women are consistently ranked higher than men. In particular, notice that there is not a single guy in the top 100.
- Face should occupy about 1/3 of the image. Notice that the position and pose of the face is quite consistent among the top images. The face always occupies about 1/3 of the image, is slightly tilted, and is positioned in the center and at the top. Which also brings me to:
- Cut off your forehead. What's up with that? It looks like a popular strategy, at least for women.
- Show your long hair. Notice the frequent prominence of long strands of hair running down the shoulders.
- Oversaturate the face. Notice the frequent occurrence of over-saturated lighting, which often makes the face look much more uniform and faded out. Related to that,
- Put a filter on it. Black and White photos seem to do quite well, and most of the top images seem to contain some kind of a filter that fades out the image and decreases the contrast.
- Add a border. You will notice a frequent appearance of horizontal/vertical white borders.
Best few male selfies taken from the top 2,000 selfies.
Lets also look at some of the worst selfies, which the ConvNet is quite certain would not receive a lot of likes. I am showing the images in a much smaller and less identifiable format because my intention is for us to learn about the broad patterns that decrease the selfie's quality, not to shine light on people who happened to take a bad selfie. Here they are:
Worst 300 out of 50,000 selfies, as judged by the Convolutional Neural Network.
- Take selfies in low lighting. Very consistently, darker photos (which usually include much more noise as well) are ranked very low by the ConvNet.
- Frame your head too large. Presumably no one wants to see such an up-close view.
- Take group shots. It's fun to take selfies with your friends but this seems to not work very well. Keep it simple and take up all the space yourself. But not too much space.
Celebrities. As a last fun experiment, I tried to run the ConvNet on a few famous celebrity selfies, and sorted the results with the continuum visualization, where the best selfies are on the top and the ConvNet score decreases to the right and then towards the bottom:
Celebrity selfies as
judged by a Convolutional Neural Network. Most attractive selfies: Top
left, then deceasing in quality first to the right then towards the
bottom. Right click > Open Image in new tab on this image to see it in higher resolution.
Another one of our rules of thumb (no males) is confidently defied by Chris Pratt's body (also 2nd row), and honorable mentions go to Justin Beiber's raised eyebrows and Stephen Collbert / Jimmy Fallon duo (3rd row). James Franco's selfie shows quite a lot more skin than Chris', but the ConvNet is not very impressed (4th row). Neither was I.
Lastly, notice again the importance of style. There are several uncontroversially-good-looking people who still appear on the bottom of the list, due to bad framing (e.g. head too large possibly for J Lo), bad lighting, etc.
Exploring the #selfie space
Another fun visualization we can try is to lay out the selfies with t-SNE. t-SNE is a wonderful algorithm that I like to run on nearly anything I can because it's both very general and very effective - it takes some number of things (e.g. images in our case) and lays them out in such way that nearby things are similar. You can in fact lay out many things with t-SNE, such as Netflix movies, words, Twitter profiles, ImageNet images, or really anything where you have some number of things and a way of comparing how similar two things are. In our case we will lay out selfies based on how similar the ConvNet perceives them. In technical terms, we are doing this based on L2 norms of the fc7 activations in the last fully-connected layer. Here is the visualization:
Selfie t-SNE visualization. Here is a link to a higher-resolution version. (9MB)
Finding the Optimal Crop for a selfie
Another fun experiment we can run is to use the ConvNet to automatically find the best selfie crops. That is, we will take an image, randomly try out many different possible crops and then select the one that the ConvNet thinks looks best. Below are four examples of the process, where I show the original selfies on the left, and the ConvNet-cropped selfies on the right:
Each of the four pairs shows the original image (left) and the crop that was selected by the ConvNet as looking best (right).
Same visualization as above, with originals on left and best crops on right. The one on the right is my favorite.
How good is yours?
Curious about what the network thinks of your selfies? I've packaged the network into a Twitter bot so that you can easily find out. (The bot turns out to be onyl ~150 lines of Python, including all Caffe/Tweepy code). Attach your image to a tweet (or include a link) and mention the bot @deepselfie anywhere in the tweet. The bot will take a look at your selfie and then pitch in with its opinion! For best results link to a square image, otherwise the bot will have to squish it to a square, which deteriorates the results. The bot should reply within a minute or something went wrong (try again later).
Example interaction with the Selfie Bot (@deepselfie).
Conclusion
I hope I've given you a taste of how powerful Convolutional Neural Networks are. You give them example images with some labels, they learn to recognize those things automatically, and it all works very well and is very fast (at least at test time, once it's trained). Of course, we've only barely scratched the surface - ConvNets are used as a basic building block in many Neural Networks, not just to classify images/videos but also to segment, detect, and describe, both in the cloud or in robots.If you'd liked to learn more, the best place to start for a beginner right now is probably Michael Nielsen's tutorials. From there I would encourage you to first look at Andrew Ng's Coursera class, and then next I would go through course notes/assignments for CS231n. This is a class specifically on ConvNets that I taught together with Fei-Fei at Stanford last Winter quarter. We will also be offering the class again starting January 2016 and you're free to follow along. For more advanced material I would look into Hugo Larochelle's Neural Networks class or the Deep Learning book currently being written by Yoshua Bengio, Ian Goodfellow and Aaron Courville.
Of course you'll learn much more by doing than by reading, so I'd recommend that you play with 101 Kaggle Challenges, or that you develop your own side projects, in which case I warmly recommend that you not only do but also write about it, and post it places for all of us to read, for example on /r/machinelearning which has accumulated a nice community. As for recommended tools, the three common options right now are:
- Caffe (C++, Python/Matlab wrappers), which I used in this post. If you're looking to do basic Image Classification then Caffe is the easiest way to go, in many cases requiring you to write no code, just invoking included scripts.
- Theano-based Deep Learning libraries (Python) such as Keras or Lasagne, which allow more flexibility.
- Torch (C++, Lua), which is what I currently use in my research. I'd recommend Torch for the most advanced users, as it offers a lot of freedom, flexibility, speed, all with quite simple abstractions.
Lastly, there are a few companies out there who aspire to bring Deep Learning to the masses. One example is MetaMind, who offer web interface that allows you to drag and drop images and train a ConvNet (they handle all of the details in the cloud). MetaMind and Clarifai also offer ConvNet REST APIs.
That's it, see you next time!