Thursday, December 29, 2016

A guide to deep learning

http://yerevann.com/a-guide-to-deep-learning/

Sunday, December 25, 2016

Artificial Intelligence and You: Demystifying the Technology Landscape

https://thinkgrowth.org/artificial-intelligence-and-you-demystifying-the-technology-landscape-21d4d34fbb10

Artificial Intelligence and You: Demystifying the Technology Landscape

A new report from HubSpot

AI technologies today range from simple to extraordinarily complicated. Basic chatbots are enabled by natural language processing, which allows them to understand human language. More complex technologies like IBM’s Watson leverages machine learning to absorb huge amounts of data and create predictive algorithms.

The AI Landscape

The concept of artificial intelligence (AI) has been tossed around in books, movies, and media for decades — from HAL 9000 wreaking havoc (2001: A Space Odyssey) to Marvin the Paranoid Android sadly going about its existence (The Hitchhiker’s Guide to the Galaxy).

We’ve been trained to expect AI-powered devices to have fully developed personalities who also can either fly or at least fly spaceships. So some people may be disappointed to hear from tech companies and the media about how we’re entering the age of AI — despite not seeing a single adorable robot zooming around.

Truth is, we are at the very beginning of the AI revolution, but this technology has the potential to change the way we live and work. Even the Obama administration drafted a report that outlines the future impact of AI developments.

As we sit on the cusp of AI innovation, we thought it was necessary to break down AI in a practical way. What does the current product landscape look like, and what technologies are driving these tools? It can be a confusing mess to an interested observer without a deep technical background (i.e., most of us).

In general terms, artificial intelligence is technology that can do things humans can uniquely do, whether it’s talk, see, learn, socialize, and reason. That’s a very broad set of skills, and when most of us think of AI, we envision a human-like android that can do all of those things.

Let’s set expectations: AI technology isn’t advanced enough to produce a Terminator … at least not yet.

AI technologies aren’t sentient. For example, many of us have heard the term “chatbot” thrown around, but are chatbots technically artificial intelligence technologies?

While chatbots have the ability to understand how humans communicate, they don’t “think” for themselves autonomously. Some argue that chatbots aren’t really AI because it can only follow through with pre-loaded responses or actions, while others point out that a chatbots’ ability to understand the way we humans write and speak gives it an edge of artificial intelligence.

In fact, technologists today often disagree about what falls under the AI umbrella — and we’ll leave the ‘what qualifies as AI’ debate to the experts. This report focuses on giving our readers an outline of current, generally-recognized-as-AI technologies.

Slackbot wants to remind you, it’s “still just a bot”.

An AI Tool Primer

Let’s start by reviewing the AI landscape and the most well-known tools and products.

In the graphic below, the x-axis reflects the level of technical sophistication the AI tool has. The y-axis represents the mass appeal of the tool.

Want this chart? Click here to download.

Most chatbots have a narrow use case, which is why they’re placed on the “Narrow AI” side in our chart. A key feature of a chatbot is its ability to understand text-based commands. Think of chatbots as the evolution of the command line. Instead of us learning how to ask a program to do something in their language (“C:\DOS\RUN”), programs now understand our language (“Hey bot, what’s the weather today?”).

Although chatbots typically have only basic functionality, there is the potential for high adoption in the future, hence their high placement on the y-axis. Whether they help us buy flowers (see below), tell us how many visitors our website received in the past day, or even help us register to vote, these applications are the perfect example of how AI-powered tools will help people to find information and complete tasks more efficiently.

Buying Flowers Through A Chatbot

Chatbots are programs enabled by AI, which differentiates them from more sophisticated personal assistant programs such as Amazon’s Alexa assistant. Here’s why: While chatbots can understand language, they can’t do a lot beyond their core use case.

We can ask a chatbot like Poncho, which is designed to tell us the weather, about the weather. However, if we ask Poncho to tell us the Cubs’ 2016 record, it won’t know the answer because it isn’t designed to handle requests for information out of its scope.

In contrast, programs such as Siri have the natural language capabilities of chatbots, but also operate with more sophisticated functionalities that are driven by machine learning. For example, Siri can act as a calculator, find the Cubs’ record, and tell you the weather. That’s because programs like Siri aren’t just AI-enabled like chatbots; they’re powered by AI.

On the other side of the AI sophistication spectrum are tools such as IBM’s Watson. While Watson might be famous for beating humans at Jeopardy, it has many practical applications because of its deep learning capabilities, as we found when HubSpot interviewed Ari Sheinkin, VP of Marketing Analytics at IBM.

Watson uses machine learning to process huge amounts of data on nearly any topic. Unlike a weather chatbot that can only tell you today’s forecast, Watson’s machine learning capabilities, when properly configured, can scan MRIs to spot tumors in humans, figure out what products to recommend people browsing on websites, and, yes, answer Jeopardy questions on national TV.

Machine learning tools such as Watson not only recognize patterns and commonalities in large datasets, it leverages its past learnings to become smarter over time. While the practical uses for Watson appear endless, most individual consumers aren’t going to interact with it in the near future, so for now it’s relatively low on the mass adoption axis.

“[AI] is helping doctors make better diagnoses. It is helping … film editors make better films. It’s helping musicians make better music. It’s helping marketers make better bids on paid media.” — Ari Sheinkin, VP of Marketing Analytics, IBM

The AI Terms You Need To Know

There are three major disciplines within AI today:

Natural language processing
Machine learning
Neural networks (an off-shot of machine learning)

The descriptions below are simple guides to each technology. (Technologists, please put away your pitchforks.)

Natural language processing (NLP)

NLP enables a technology to understand text- or voice-based commands. Understanding the complexities of human language and sentence structure has typically been touted as a key trait of artificial intelligence.

Tools like Siri can hear you ask a question through speech recognition, translate your words into a text-based query using natural language processing, and then execute the query to find the answer to your question or request. Once Siri finds your answer, it constructs an understandable sentence that includes the answer, and it reads it aloud to you.

Want this image? Click here to download.

Machine learning

Machine learning is a broad category that most of today’s sophisticated AI programs fall under. The basic definition of machine learning is as “clear as mud”: it “gives computers the ability to learn without being explicitly programmed.”

Essentially, someone can dump a bunch of data into a machine learning program, and the program will be able to sort through the information, form conclusions, and generate predictions based on those conclusions.

The data could be anything:

a marketer’s email open rates from the past year
the top shared Facebook posts in Brazil
the buying history of a retailer’s customers

A machine learning program can take that data and predict the best times to send out an email, identify the topics that will perform best on Facebook, or provide purchase recommendations to customers based on what they’ve bought in the past.

An AI program can be “supervised,” where it’s programmed to solve for a concrete goal: “Find me the best time to send emails based on past open rates.” Or the program can be “unsupervised” and left to find any patterns on its own: “Here’s a mess of data. Tell me what patterns they have.”

Neural networks

Neural networks are a type of machine learning that can be a little difficult to grasp. Basically, neural networks are built to replicate the way the human brain works — with neuron’s firing and sending information back and forth, creating layers of context and associations as a result.

Neural networks are the means to achieving artificial intelligence: by creating a program that mimics human intelligence, the program will eventually become intelligent on its own.

Right now, the most well-known neural network is Google’s DeepDream, which is currently combing all the images on the internet. As DeepDream learns to recognize objects and places, it generates some very odd and sometimes creepy images in its wake.

DeepDream actually produces the images it sees when it inspects a picture. The weird psychedelic flourishes found are actually patterns that the software has observed and enhanced.

It’s not entirely clear what Google will do with DeepDream, but it’s safe to expect that it’ll follow IBM’s Watson playbook and productize DeepMind, its master AI product, after it matures as a program. Google’s AI programs already feed into its Maps, Personal Assistant, Home, Google Translate, and Advertising services.

Making Sense of AI Technologies

The table below provides more context to the technologies we’ve profiled. It’s not meant to be exhaustive, but it highlights the outcomes generated by each technology category and their increasing sophistication.

Want this table? Click here to download.

One niche example is Google DeepMind’s AlphaGo, a deep learning program that learned to play the ancient board game Go. AlphaGo only does one thing, but the technology behind is notable because the program can organically learn anything.

Another, more disastrous example is Microsoft’s Tay chatbot, which is designed to learn and then mimic how teenagers text and tweet. The goal was for Tay to learn to write like a “cool” teenager by analyzing tweets sent its way so it could generate original content. The problem was … the internet. After only one day, Tay began tweeting inappropriate content after Twitter trolls sent it racist, sexist, and offensive content. Microsoft suffered a huge PR embarrassment and very quickly put Tay back in beta.

Tay’s misadventures prove it’s still early days for AI.

AI and You

Artificial intelligence has already started to impact all of our lives. As artificial intelligence capabilities advance, there will be many exciting innovations that we’ll continue to track here at HubSpot.

For all of us as individual consumers, AI will change how we obtain information, interact with loved ones, and buy from businesses. As professionals, AI has the potential to augment the way we work and make us much more efficient by taking care of the busywork we all hate doing. On a societal level, AI can help us discover cures to diseases, facilitate the education of millions of people, and yes, even generate heated debates on ethics.

Interested in learning more about how people have adopted AI technologies today?

Follow this publication: we have more research coming soon on how 1,400+ consumers are already using AI technologies (whether they know it or not)
Check out some more posts in our Understanding AI series: discover what AI analysis revealed about sales call effectiveness, how AI is going to transform the lives of artists, and learn why AI’s next challenge is access not interface.

Dear Professionals, It’s Time to Stop Pretending AI Won’t Take Our Jobs

https://thinkgrowth.org/dear-professionals-its-time-to-stop-pretending-ai-won-t-take-our-jobs-b19b7092d306

Dear Professionals, It’s Time to Stop Pretending AI Won’t Take Our Jobs

AI won’t only take “routine” jobs, and that’s ok…maybe

“About 1 of every 15 workers in the country is employed in the trucking business, according to the American Trucking Association,” proclaims the homepage of alltrucking.com. “These figures indicate that trucking is an exceptionally stable industry that is likely to continue generating jobs in the coming years.”

Considering that self-driving trucks are already on the road, this seems — at best — blindly optimistic. But while many educated professionals are willing to concede “routine” jobs like driving or burger flipping to AI, we share the same blind optimism when it comes to the staying power of our own work.

This is a mistake.

When the first machine age hit in the late 19th century there were no antibiotics, public sanitation was still a cutting edge idea, and 80% of the world couldn’t read. Most people were just trying to survive, few were in a position to prepare for how industrialization was about to upend their world.

The first machine age added previously unimaginable wealth and convenience to our daily lives, but it also wasted our rivers, took full advantage of the poor and oppressed, and set the stage for 100 years of labor wars.

Here we are on the cusp of the second machine age. We are better educated, better fed, and enjoy better healthcare than the vast majority of anyone in any period of history. And the machines are once again about to upend life as we know it — from jobs and social life to the functioning of our economy.

Will we be active participants in thinking about and creating this new world, or will be hold onto to a fading notion that somehow, magically our way of life will be safe?

AI is going to eliminate millions of jobs

This is inevitable.

Over the past week in our Understanding AI series, we heard from a group of professionals exploring the impacts of AI.

Chris Orlob from Gong.io shared a glimpse into the future of sales coaching with AI-powered analysis of 25,537 sales calls. It is impossible for any sales coach or manager to deliver this kind of insight. If a sales coach listened to 45-minute sales calls for 8 hours a day, it would take them about 9 years to comb through this data, assuming of course they take 0 vacation days and use their nights and weekends to actually deliver that feedback.

How AI is Taking the Guesswork Out of Sales Call Effectiveness
9 insights from 25,537 sales callsthinkgrowth.org

A sales coach is not what we typically consider a routine job. Sales coaches are valued, highly experienced, and well-paid. It’s hard to believe that a machine could potentially put them out of work, and yet…look at the data.

And Sales’ counterparts over in Marketing are hardly any safer. Marcus Andrews is predicting that we’re on the cusp of machines taking over things like content strategy and email marketing.

How AI Is (and Will Be) Transforming Business
In the Understanding AI series we’re exploring how AI is transforming the way we work and live — from sales and…thinkgrowth.org

While graphic designers aren’t being put out of work by AI (yet), they won’t remain free from its impact much longer. Sam Mallikarjunan touched on how machines are gaining the ability to understand and categorize images. This shift will once-and-for-all remove image making from the sphere of the immeasurable.

Artist’s Lives Are About to Change (draft)
How Search Engines Changed Writing The advent of content marketing has been fraught with complicated emotions for…thinkgrowth.org

And Google’s AI technology is already working on taking over the job completely, making its first (though admittedly creepy) foray into artistic creation.

We’re a little more comfortable with stories like the one shared by Robina Maharjan, who experimented with using an AI-based assistant to handle the tedious task of scheduling meetings. But just because we would prefer AI to only take on the menial, routine work does not mean that it will stay in that nice little box.

AI is putting us all out of work and there is nothing we can do to stop it, but is this really a bad thing?

Losing our jobs to AI isn’t a bad thing

When people say, “People need jobs!” what they are really saying is, “People need a way to earn a livelihood.”

Up until this point in history, the two were inseparable: If people were not working, value was not being created, and society as a whole suffered. We’re entering a time when this equation is no longer true, or it is at least a less balanced ratio.

In 1964, Reverend Martin Luther King, Jr. gave a speech where he referenced his concern regarding the “triple revolution” — technological revolution, weapons revolution, and the human rights revolution. He was referring to a report written by a group of academics, journalists, and technologists.

The report anticipated a rise in automation that would result in, quoting Rise of the Robots here:

Massive unemployment, soaring inequality, and, ultimately, falling demand for goods and services as consumers increasingly lacked the purchasing power necessary to continue driving economic growth.

With 50 years between us and the triple revolution report, we can visualize this gloomy future with uncomfortable clarity.

We’ve seen the rise of unemployment and wage stagnation. And while it sounds crazy, we’re on the brink of seeing the jobs of 3.5 million truckers, and the millions more that serve the transportation industry, being eliminated. We’re a handful of years (months?) away from bots eliminating millions of customer service jobs.

The negative impacts of massive unemployment are so frighteningly believable that we would rather simply disengage from the discussion — but the negative is not the only option for our future.

The other side of the spectrum is much brighter — it’s a future where we all share in the value created by the machine laborers. Maybe that means fewer jobs and the rise of job sharing (a practice that has been in use for years in European countries to reduce unemployment). Maybe the future is shorter workweeks or basic income, and no one ever having to do a crappy job that they hate.

This is an uncomfortable thought exercise for most of us. The Protestant work ethic is so built into how we think about the meaning and value of our lives, that imagining a world of no jobs, or just less work, feels…somehow… wrong. But again, this can’t be an excuse for ignoring this possible future.

Of course, I’m painting two extreme ends of the spectrum. The reality will certainly be far more nuanced and is almost entirely unknowable. But at this moment, any version of the future is equally possible. And now is not the time to just “see how things turn out.”

What will you do about it

60 years ago, MLK saw automation as being inseparable from the core issues of the time. It still is. You cannot talk about human rights, climate change, income inequality, social justice, or globalization without talking about AI.

Today, it is the responsibility of every single professional, really every single person, to understand the implications of the second machine age and actively participate in creating a new future.

Stop hiding your head in the sand.

Stop pretending your job is safe.

Care about this conversation, it is the future and it is being created right now.

So, here’s your holiday reading list. Take that magical week leading up to the New Year to get up to speed on the implications of AI.

It’s important.

Artificial Intelligence and You: Demystifying the Technology Landscape

This new report from HubSpot is the beginner’s guide to AI. It’s the quickest way to get up to speed on the terms and the use cases. Learn how you’re already using AI and what’s coming next.

Rise of the Robots: Technology and the Threat of a Jobless Future

This is arguably the definitive text on AI and the future of work. Martin Ford walks through the many industries and “good jobs” that will be disrupted by AI and begins to introduce some potential paths forward in the jobless future.

The Wealth of Humans: Work, Power, and Status in the Twenty-first Century

This is Economist editor Ryan Avent’s call to imagine a new economy. It is equal parts gloomy and optimistic, realistic and idealistic. I’ll borrow his closing words to close this article:

Face to face with the unknown, it is hard to know what to feel or what to do. It is tempting to be afraid. But, faced with this great, powerful, transformative force, we shouldn’t be frightened. We should be generous. We should be as generous as we can be.

Scroll down to the comments to read what other people are saying about this topic…theres’ some good stuff. John Evens weighed in with what I think is the single best argument for what will prevent the “AI dream” — humans will overcomplicate the whole thing and turn it into an “inefficient monstrosity.” A very real possibility :) Robina Maharjan made the argument that AI will create new and more interesting, creative jobs. And Sam Mallikarjunan added the warning that while we think AI will come for the “routine” tasks first, the higher pay that goes to mid and high level professionals increases the economic incentive to automate these jobs ASAP.

Deep Learning Enables You to Hide Screen when Your Boss is Approaching

http://ahogrammer.com/2016/11/15/deep-learning-enables-you-to-hide-screen-when-your-boss-is-approaching/

Introduction

When you are working, you have browsed information that is not relevant to your work, haven’t you?
I feel awkward when my boss is creeping behind. Of course, I can switch the screen in a hurry, but such behavior is suspicious, and sometimes I don’t notice him. So, in order to switch the screen without being suspected, I create a system that automatically recognizes that he is approaching to me and hides the screen.
Specifically, Keras is used to implement neural network for learning his face, a web camera is used to recognize that he is approaching, and switching the screen.

Mission

The mission is to switch the screen automatically when my boss is approaching to me.
The situation is as follows:
situation-1

It is about 6 or 7 meters from his seat to my seat. He reaches my seat in 4 or 5 seconds after he leaves his seat. Therefore, it is necessary to hide the screen during this time. There’s not much time!

Strategy

Maybe you have various strategies, but my strategy is following.
First, let the computer learn the face of the boss with deep learning. Then, set up a web camera at my desk and switch the screen when the web camera captures his face. It’s a perfect strategy. Let’s call this wonderful system Boss Sensor.

System Architecture

The simple system architecture of the Boss Sensor is as follows.
process

Web camera take an image in real time.
Learned model detect and recognize face for the taken image.
If the recognition result is my boss, switch screen.

The following techniques are required to do above:

Taking face image
Recognizing face image
Switching screen

Let’s verify one by one, then integrate at the end.

Taking Face Image

First of all, taking face image with webcam.
This time, I used BUFFALO BSW20KM11BK as webcam.

You can also take image from the camera with the included software, but it is better to be able to take from the program because of considering the processing afterwards. Also, since face recognition is done in the subsequent processing, it is necessary to cut out only the face image. So, I use Python and OpenCV to take face image. Here’s the code for that:

BossSensor/camera_reader.py

I was able to acquire a more clearly face image than I expected.
スクリーンショット 2016-09-15 9.49.32.png

Recognizing Boss Face

Next, we use machine learning so that the computer can recognize the face of the boss.
We need the following three steps:

Collecting images
Preprocessing images
Building Machine Learning Model

Let’s take a look at these one by one.

Collecting Images

First of all, I need to collect a large number of images for learning. As a collection method, I used the following:

Google image search
Image collection on Facebook
Taking video

Initially, I collected images from Web search and Facebook, but enough images did not gather. So, I took video using a video camera and decomposed video into a large number of images.

Preprocessing Images

Well, I got a lot of images with faces, but the learning model can not be learned as it is. This is because the part not related to the face occupies a considerable part of the image. So we cut out only the face image.
I mainly used ImageMagick for extraction. You can get only face images by cutting out with ImageMagick.

ImageMagick

A large number of face images gathered as follows:

Perhaps I am the one who possesses the face image of the most boss in the world. I must have it more than his parents.
Now I’m ready for learning.

Building Machine Learning Model

Keras is used to build convolutional neural network(CNN) and CNN is trained. TensorFlow is used for Keras’s back end. If you only recognize the face, you can call the Web API for image recognition like Computer Vision API in Cognitive Services, but this time I decided to make it by myself considering real time nature.
The network has the following architecture. Keras is convenient because it can output the architecture easily.

____________________________________________________________________________________________________
Layer (type)                     Output Shape          Param #     Connected to                     
====================================================================================================
convolution2d_1 (Convolution2D)  (None, 32, 64, 64)    896         convolution2d_input_1[0][0]      
____________________________________________________________________________________________________
activation_1 (Activation)        (None, 32, 64, 64)    0           convolution2d_1[0][0]            
____________________________________________________________________________________________________
convolution2d_2 (Convolution2D)  (None, 32, 62, 62)    9248        activation_1[0][0]               
____________________________________________________________________________________________________
activation_2 (Activation)        (None, 32, 62, 62)    0           convolution2d_2[0][0]            
____________________________________________________________________________________________________
maxpooling2d_1 (MaxPooling2D)    (None, 32, 31, 31)    0           activation_2[0][0]               
____________________________________________________________________________________________________
dropout_1 (Dropout)              (None, 32, 31, 31)    0           maxpooling2d_1[0][0]             
____________________________________________________________________________________________________
convolution2d_3 (Convolution2D)  (None, 64, 31, 31)    18496       dropout_1[0][0]                  
____________________________________________________________________________________________________
activation_3 (Activation)        (None, 64, 31, 31)    0           convolution2d_3[0][0]            
____________________________________________________________________________________________________
convolution2d_4 (Convolution2D)  (None, 64, 29, 29)    36928       activation_3[0][0]               
____________________________________________________________________________________________________
activation_4 (Activation)        (None, 64, 29, 29)    0           convolution2d_4[0][0]            
____________________________________________________________________________________________________
maxpooling2d_2 (MaxPooling2D)    (None, 64, 14, 14)    0           activation_4[0][0]               
____________________________________________________________________________________________________
dropout_2 (Dropout)              (None, 64, 14, 14)    0           maxpooling2d_2[0][0]             
____________________________________________________________________________________________________
flatten_1 (Flatten)              (None, 12544)         0           dropout_2[0][0]                  
____________________________________________________________________________________________________
dense_1 (Dense)                  (None, 512)           6423040     flatten_1[0][0]                  
____________________________________________________________________________________________________
activation_5 (Activation)        (None, 512)           0           dense_1[0][0]                    
____________________________________________________________________________________________________
dropout_3 (Dropout)              (None, 512)           0           activation_5[0][0]               
____________________________________________________________________________________________________
dense_2 (Dense)                  (None, 2)             1026        dropout_3[0][0]                  
____________________________________________________________________________________________________
activation_6 (Activation)        (None, 2)             0           dense_2[0][0]                    
====================================================================================================
Total params: 6489634

The code is here:

BossSensor/boss_train.py

So far, I can recognize the boss when he appears on the camera.

Switching Screen

Now, when learned model recognize the face of the boss, I need to change the screen. In this time, let’s display the image to pretend to work.
I am a programmer so I prepared the following image.

I only display this image.
Since I want to display the image in full screen, I use PyQt. Here’s the code for that:

BossSensor/image_show.py

Now, everything is ready.

Finished Product

Once we integrate the technologies we have verified, we are done. I actually tried it.
“My boss left his seat and he was approaching to my seat.”

“OpenCV has detected the face and input the image into the learned model.”

“The screen has switched by recognizing him! ヽ(‘ ∇‘ )ノﾜｰｲ”

Source Code

You can download Boss Sensor from following link:

BossSensor

Your star encourage me ｍ(_ _)ｍ

Conclusion

In this time, I combined the real-time image acquisition from Web camera with face recognition using Keras to recognize my boss and hide the screen.
Currently, I detect the face with OpenCV, but since the accuracy of face detection in OpenCV seems not good, I’d like to try using Dlib to improve the accuracy. Also I would like to use my own trained face detection model.
Since the recognition accuracy of the image acquired from the Web camera is not good, I would like to improve it.

Thursday, December 22, 2016

Teaching a Machine to Steer a Car

https://medium.com/udacity/teaching-a-machine-to-steer-a-car-d73217f2492c

Must Know Tips/Tricks in Deep Neural Networks

http://lamda.nju.edu.cn/weixs/project/CNNTricks/CNNTricks.html

Must Know Tips/Tricks in Deep Neural Networks (by Xiu-Shen Wei)

Deep Neural Networks, especially Convolutional Neural Networks (CNN), allows computational models that are composed of multiple processing layers to learn representations of data with multiple levels of abstraction. These methods have dramatically improved the state-of-the-arts in visual object recognition, object detection, text recognition and many other domains such as drug discovery and genomics.
In addition, many solid papers have been published in this topic, and some high quality open source CNN software packages have been made available. There are also well-written CNN tutorials or CNN software manuals. However, it might lack a recent and comprehensive summary about the details of how to implement an excellent deep convolutional neural networks from scratch. Thus, we collected and concluded many implementation details for DCNNs. Here we will introduce these extensive implementation details, i.e., tricks or tips, for building and training your own deep networks.

Introduction

We assume you already know the basic knowledge of deep learning, and here we will present the implementation details (tricks or tips) in Deep Neural Networks, especially CNN for image-related tasks, mainly in eight aspects: 1) data augmentation; 2) pre-processing on images; 3) initializations of Networks; 4) some tips during training; 5) selections of activation functions; 6) diverse regularizations; 7) some insights found from figures and finally 8) methods of ensemble multiple deep networks.
Additionally, the corresponding slides are available at [slide]. If there are any problems/mistakes in these materials and slides, or there are something important/interesting you consider that should be added, just feel free to contact me.

Sec. 1: Data Augmentation

Since deep networks need to be trained on a huge number of training images to achieve satisfactory performance, if the original image data set contains limited training images, it is better to do data augmentation to boost the performance. Also, data augmentation becomes the thing must to do when training a deep network.

There are many ways to do data augmentation, such as the popular horizontally flipping, random crops and color jittering. Moreover, you could try combinations of multiple different processing, e.g., doing the rotation and random scaling at the same time. In addition, you can try to raise saturation and value (S and V components of the HSV color space) of all pixels to a power between 0.25 and 4 (same for all pixels within a patch), multiply these values by a factor between 0.7 and 1.4, and add to them a value between -0.1 and 0.1. Also, you could add a value between [-0.1, 0.1] to the hue (H component of HSV) of all pixels in the image/patch.

Krizhevsky et al. [1] proposed fancy PCA when training the famous Alex-Net in 2012. Fancy PCA alters the intensities of the RGB channels in training images. In practice, you can firstly perform PCA on the set of RGB pixel values throughout your training images. And then, for each training image, just add the following quantity to each RGB image pixel (i.e., $I_{xy}=[I_{xy}^R,I_{xy}^G,I_{xy}^B]^T$ ): $[bf{p}_1,bf{p}_2,bf{p}_3][alpha_1 lambda_1,alpha_2 lambda_2,alpha_3 lambda_3]^T$ where, $bf{p}_i$ and are the -th eigenvector and eigenvalue of the covariance matrix of RGB pixel values, respectively, and is a random variable drawn from a Gaussian with mean zero and standard deviation 0.1. Please note that, each is drawn only once for all the pixels of a particular training image until that image is used for training again. That is to say, when the model meets the same training image again, it will randomly produce another for data augmentation. In [1], they claimed that “fancy PCA could approximately capture an important property of natural images, namely, that object identity is invariant to changes in the intensity and color of the illumination”. To the classification performance, this scheme reduced the top-1 error rate by over 1% in the competition of ImageNet 2012.

Sec. 2: Pre-Processing

Now we have obtained a large number of training samples (images/crops), but please do not hurry! Actually, it is necessary to do pre-processing on these images/crops. In this section, we will introduce several approaches for pre-processing.
The first and simple pre-processing approach is zero-center the data, and then normalize them, which is presented as two lines Python codes as follows:

>>> X -= np.mean(X, axis = 0) # zero-center
>>> X /= np.std(X, axis = 0) # normalize

where, X is the input data (NumIns×NumDim). Another form of this pre-processing normalizes each dimension so that the min and max along the dimension is -1 and 1 respectively. It only makes sense to apply this pre-processing if you have a reason to believe that different input features have different scales (or units), but they should be of approximately equal importance to the learning algorithm. In case of images, the relative scales of pixels are already approximately equal (and in range from 0 to 255), so it is not strictly necessary to perform this additional pre-processing step.
Another pre-processing approach similar to the first one is PCA Whitening. In this process, the data is first centered as described above. Then, you can compute the covariance matrix that tells us about the correlation structure in the data:

>>> X -= np.mean(X, axis = 0) # zero-center
>>> cov = np.dot(X.T, X) / X.shape[0] # compute the covariance matrix

After that, you decorrelate the data by projecting the original (but zero-centered) data into the eigenbasis:

>>> U,S,V = np.linalg.svd(cov) # compute the SVD factorization of the data covariance matrix
>>> Xrot = np.dot(X, U) # decorrelate the data

The last transformation is whitening, which takes the data in the eigenbasis and divides every dimension by the eigenvalue to normalize the scale:

>>> Xwhite = Xrot / np.sqrt(S + 1e-5) # divide by the eigenvalues (which are square roots of the singular values)

Note that here it adds 1e-5 (or a small constant) to prevent division by zero. One weakness of this transformation is that it can greatly exaggerate the noise in the data, since it stretches all dimensions (including the irrelevant dimensions of tiny variance that are mostly noise) to be of equal size in the input. This can in practice be mitigated by stronger smoothing (i.e., increasing 1e-5 to be a larger number).
Please note that, we describe these pre-processing here just for completeness. In practice, these transformations are not used with Convolutional Neural Networks. However, it is also very important to zero-center the data, and it is common to see normalization of every pixel as well.

Sec. 3: Initializations

Now the data is ready. However, before you are beginning to train the network, you have to initialize its parameters.

All Zero Initialization

In the ideal situation, with proper data normalization it is reasonable to assume that approximately half of the weights will be positive and half of them will be negative. A reasonable-sounding idea then might be to set all the initial weights to zero, which you expect to be the “best guess” in expectation. But, this turns out to be a mistake, because if every neuron in the network computes the same output, then they will also all compute the same gradients during back-propagation and undergo the exact same parameter updates. In other words, there is no source of asymmetry between neurons if their weights are initialized to be the same.

Initialization with Small Random Numbers

Thus, you still want the weights to be very close to zero, but not identically zero. In this way, you can random these neurons to small numbers which are very close to zero, and it is treated as symmetry breaking. The idea is that the neurons are all random and unique in the beginning, so they will compute distinct updates and integrate themselves as diverse parts of the full network. The implementation for weights might simply look like weightssim 0.001times N(0,1)

, where

is a zero mean, unit standard deviation gaussian. It is also possible to use small numbers drawn from a uniform distribution, but this seems to have relatively little impact on the final performance in practice.

Calibrating the Variances

One problem with the above suggestion is that the distribution of the outputs from a randomly initialized neuron has a variance that grows with the number of inputs. It turns out that you can normalize the variance of each neuron's output to 1 by scaling its weight vector by the square root of its fan-in (i.e., its number of inputs), which is as follows:

>>> w = np.random.randn(n) / sqrt(n) # calibrating the variances with 1/sqrt(n)

where “randn” is the aforementioned Gaussian and “n” is the number of its inputs. This ensures that all neurons in the network initially have approximately the same output distribution and empirically improves the rate of convergence. The detailed derivations can be found from Page. 18 to 23 of the slides. Please note that, in the derivations, it does not consider the influence of ReLU neurons.

Current Recommendation

As aforementioned, the previous initialization by calibrating the variances of neurons is without considering ReLUs. A more recent paper on this topic by He et al. [4] derives an initialization specifically for ReLUs, reaching the conclusion that the variance of neurons in the network should be 2.0/n

as:

>>> w = np.random.randn(n) * sqrt(2.0/n) # current recommendation

which is the current recommendation for use in practice, as discussed in [4].

Sec. 4: During Training

Now, everything is ready. Let’s start to train deep networks!

Filters and pooling size. During training, the size of input images prefers to be power-of-2, such as 32 (e.g., CIFAR-10), 64, 224 (e.g., common used ImageNet), 384 or 512, etc. Moreover, it is important to employ a small filter (e.g., ) and small strides (e.g., 1) with zeros-padding, which not only reduces the number of parameters, but improves the accuracy rates of the whole deep network. Meanwhile, a special case mentioned above, i.e., filters with stride 1, could preserve the spatial size of images/feature maps. For the pooling layers, the common used pooling size is of .

Learning rate. In addition, as described in a blog by Ilya Sutskever [2], he recommended to divide the gradients by mini batch size. Thus, you should not always change the learning rates (LR), if you change the mini batch size. For obtaining an appropriate LR, utilizing the validation set is an effective way. Usually, a typical value of LR in the beginning of your training is 0.1. In practice, if you see that you stopped making progress on the validation set, divide the LR by 2 (or by 5), and keep going, which might give you a surprise.

Fine-tune on pre-trained models. Nowadays, many state-of-the-arts deep networks are released by famous research groups, i.e., Caffe Model Zoo and VGG Group. Thanks to the wonderful generalization abilities of pre-trained deep models, you could employ these pre-trained models for your own applications directly. For further improving the classification performance on your data set, a very simple yet effective approach is to fine-tune the pre-trained models on your own data. As shown in following table, the two most important factors are the size of the new data set (small or big), and its similarity to the original data set. Different strategies of fine-tuning can be utilized in different situations. For instance, a good case is that your new data set is very similar to the data used for training pre-trained models. In that case, if you have very little data, you can just train a linear classifier on the features extracted from the top layers of pre-trained models. If your have quite a lot of data at hand, please fine-tune a few top layers of pre-trained models with a small learning rate. However, if your own data set is quite different from the data used in pre-trained models but with enough training images, a large number of layers should be fine-tuned on your data also with a small learning rate for improving performance. However, if your data set not only contains little data, but is very different from the data used in pre-trained models, you will be in trouble. Since the data is limited, it seems better to only train a linear classifier. Since the data set is very different, it might not be best to train the classifier from the top of the network, which contains more dataset-specific features. Instead, it might work better to train the SVM classifier on activations/features from somewhere earlier in the network.

Fine-tune your data on pre-trained models. Different strategies of fine-tuning are utilized in different situations. For data sets, Caltech-101 is similar to ImageNet, where both two are object-centric image data sets; while Place Database is different from ImageNet, where one is scene-centric and the other is object-centric.

Sec. 5: Activation Functions

One of the crucial factors in deep networks is activation function, which brings the non-linearity into networks. Here we will introduce the details and characters of some popular activation functions and give advices later in this section.

Figures courtesy of Stanford CS231n.

Sigmoid

The sigmoid non-linearity has the mathematical form $sigma(x)=1/(1+e^{-x})$ . It takes a real-valued number and “squashes” it into range between 0 and 1. In particular, large negative numbers become 0 and large positive numbers become 1. The sigmoid function has seen frequent use historically since it has a nice interpretation as the firing rate of a neuron: from not firing at all (0) to fully-saturated firing at an assumed maximum frequency (1).

In practice, the sigmoid non-linearity has recently fallen out of favor and it is rarely ever used. It has two major drawbacks:

Sigmoids saturate and kill gradients. A very undesirable property of the sigmoid neuron is that when the neuron's activation saturates at either tail of 0 or 1, the gradient at these regions is almost zero. Recall that during back-propagation, this (local) gradient will be multiplied to the gradient of this gate's output for the whole objective. Therefore, if the local gradient is very small, it will effectively “kill” the gradient and almost no signal will flow through the neuron to its weights and recursively to its data. Additionally, one must pay extra caution when initializing the weights of sigmoid neurons to prevent saturation. For example, if the initial weights are too large then most neurons would become saturated and the network will barely learn.
Sigmoid outputs are not zero-centered. This is undesirable since neurons in later layers of processing in a Neural Network (more on this soon) would be receiving data that is not zero-centered. This has implications on the dynamics during gradient descent, because if the data coming into a neuron is always positive (e.g., element wise in ), then the gradient on the weights will during back-propagation become either all be positive, or all negative (depending on the gradient of the whole expression ). This could introduce undesirable zig-zagging dynamics in the gradient updates for the weights. However, notice that once these gradients are added up across a batch of data the final update for the weights can have variable signs, somewhat mitigating this issue. Therefore, this is an inconvenience but it has less severe consequences compared to the saturated activation problem above.

tanh(x)

The tanh non-linearity squashes a real-valued number to the range [-1, 1]. Like the sigmoid neuron, its activations saturate, but unlike the sigmoid neuron its output is zero-centered. Therefore, in practice the tanh non-linearity is always preferred to the sigmoid nonlinearity.

Rectified Linear Unit

The Rectified Linear Unit (ReLU) has become very popular in the last few years. It computes the function f(x)=max(0,x)

, which is simply thresholded at zero.

There are several pros and cons to using the ReLUs:

(Pros) Compared to sigmoid/tanh neurons that involve expensive operations (exponentials, etc.), the ReLU can be implemented by simply thresholding a matrix of activations at zero. Meanwhile, ReLUs does not suffer from saturating.
(Pros) It was found to greatly accelerate (e.g., a factor of 6 in [1]) the convergence of stochastic gradient descent compared to the sigmoid/tanh functions. It is argued that this is due to its linear, non-saturating form.
(Cons) Unfortunately, ReLU units can be fragile during training and can “die”. For example, a large gradient flowing through a ReLU neuron could cause the weights to update in such a way that the neuron will never activate on any datapoint again. If this happens, then the gradient flowing through the unit will forever be zero from that point on. That is, the ReLU units can irreversibly die during training since they can get knocked off the data manifold. For example, you may find that as much as 40% of your network can be “dead” (i.e., neurons that never activate across the entire training dataset) if the learning rate is set too high. With a proper setting of the learning rate this is less frequently an issue.

Leaky ReLU

Leaky ReLUs are one attempt to fix the “dying ReLU” problem. Instead of the function being zero when x<0

, a leaky ReLU will instead have a small negative slope (of 0.01, or so). That is, the function computes f(x)=alpha x

and

, where

is a small constant. Some people report success with this form of activation function, but the results are not always consistent.

Parametric ReLU

Nowadays, a broader class of activation functions, namely the rectified unit family, were proposed. In the following, we will talk about the variants of ReLU.

ReLU, Leaky ReLU, PReLU and RReLU. In these figures, for PReLU, alpha_i

is learned and for Leaky ReLU alpha_i

is fixed. For RReLU, $alpha_{ji}$ is a random variable keeps sampling in a given range, and remains fixed in testing.

The first variant is called parametric rectified linear unit (PReLU) [4]. In PReLU, the slopes of negative part are learned from data rather than pre-defined. He et al. [4] claimed that PReLU is the key factor of surpassing human-level performance on ImageNet classification task. The back-propagation and updating process of PReLU is very straightforward and similar to traditional ReLU, which is shown in Page. 43 of the slides.

Randomized ReLU

The second variant is called randomized rectified linear unit (RReLU). In RReLU, the slopes of negative parts are randomized in a given range in the training, and then fixed in the testing. As mentioned in [5], in a recent Kaggle National Data Science Bowl (NDSB) competition, it is reported that RReLU could reduce overfitting due to its randomized nature. Moreover, suggested by the NDSB competition winner, the random alpha_i

in training is sampled from 1/U(3,8)

and in test time it is fixed as its expectation, i.e., 2/(l+u)=2/11

.
In [5], the authors evaluated classification performance of two state-of-the-art CNN architectures with different activation functions on the CIFAR-10, CIFAR-100 and NDSB data sets, which are shown in the following tables. Please note that, for these two networks, activation function is followed by each convolutional layer. And the in these tables actually indicates , where is the aforementioned slopes.

From these tables, we can find the performance of ReLU is not the best for all the three data sets. For Leaky ReLU, a larger slope alpha

will achieve better accuracy rates. PReLU is easy to overfit on small data sets (its training error is the smallest, while testing error is not satisfactory), but still outperforms ReLU. In addition, RReLU is significantly better than other activation functions on NDSB, which shows RReLU can overcome overfitting, because this data set has less training data than that of CIFAR-10/CIFAR-100. In conclusion, three types of ReLU variants all consistently outperform the original ReLU in these three data sets. And PReLU and RReLU seem better choices. Moreover, He et al. also reported similar conclusions in [4].

Sec. 6: Regularizations

There are several ways of controlling the capacity of Neural Networks to prevent overfitting:

L2 regularization is perhaps the most common form of regularization. It can be implemented by penalizing the squared magnitude of all parameters directly in the objective. That is, for every weight in the network, we add the term $frac{1}{2}lambda w^2$ to the objective, where is the regularization strength. It is common to see the factor of $frac{1}{2}$ in front because then the gradient of this term with respect to the parameter is simply instead of . The L2 regularization has the intuitive interpretation of heavily penalizing peaky weight vectors and preferring diffuse weight vectors.
L1 regularization is another relatively common form of regularization, where for each weight we add the term to the objective. It is possible to combine the L1 regularization with the L2 regularization: (this is called Elastic net regularization). The L1 regularization has the intriguing property that it leads the weight vectors to become sparse during optimization (i.e. very close to exactly zero). In other words, neurons with L1 regularization end up using only a sparse subset of their most important inputs and become nearly invariant to the “noisy” inputs. In comparison, final weight vectors from L2 regularization are usually diffuse, small numbers. In practice, if you are not concerned with explicit feature selection, L2 regularization can be expected to give superior performance over L1.
Max norm constraints. Another form of regularization is to enforce an absolute upper bound on the magnitude of the weight vector for every neuron and use projected gradient descent to enforce the constraint. In practice, this corresponds to performing the parameter update as normal, and then enforcing the constraint by clamping the weight vector of every neuron to satisfy $parallel vec{w} parallel_2 <c$ . Typical values of are on orders of 3 or 4. Some people report improvements when using this form of regularization. One of its appealing properties is that network cannot “explode” even when the learning rates are set too high because the updates are always bounded.
Dropout is an extremely effective, simple and recently introduced regularization technique by Srivastava et al. in [6] that complements the other methods (L1, L2, maxnorm). During training, dropout can be interpreted as sampling a Neural Network within the full Neural Network, and only updating the parameters of the sampled network based on the input data. (However, the exponential number of possible sampled networks are not independent because they share the parameters.) During testing there is no dropout applied, with the interpretation of evaluating an averaged prediction across the exponentially-sized ensemble of all sub-networks (more about ensembles in the next section). In practice, the value of dropout ratio is a reasonable default, but this can be tuned on validation data.

The most popular used regularization technique dropout [6]. While training, dropout is implemented by only keeping a neuron active with some probability

(a hyper-parameter), or setting it to zero otherwise. In addition, Google applied for a US patent for dropout in 2014.

Sec. 7: Insights from Figures

Finally, from the tips above, you can get the satisfactory settings (e.g., data processing, architectures choices and details, etc.) for your own deep networks. During training time, you can draw some figures to indicate your networks’ training effectiveness.

As we have known, the learning rate is very sensitive. From Fig. 1 in the following, a very high learning rate will cause a quite strange loss curve. A low learning rate will make your training loss decrease very slowly even after a large number of epochs. In contrast, a high learning rate will make training loss decrease fast at the beginning, but it will also drop into a local minimum. Thus, your networks might not achieve a satisfactory results in that case. For a good learning rate, as the red line shown in Fig. 1, its loss curve performs smoothly and finally it achieves the best performance.
Now let’s zoom in the loss curve. The epochs present the number of times for training once on the training data, so there are multiple mini batches in each epoch. If we draw the classification loss every training batch, the curve performs like Fig. 2. Similar to Fig. 1, if the trend of the loss curve looks too linear, that indicates your learning rate is low; if it does not decrease much, it tells you that the learning rate might be too high. Moreover, the “width” of the curve is related to the batch size. If the “width” looks too wide, that is to say the variance between every batch is too large, which points out you should increase the batch size.
Another tip comes from the accuracy curve. As shown in Fig. 3, the red line is the training accuracy, and the green line is the validation one. When the validation accuracy converges, the gap between the red line and the green one will show the effectiveness of your deep networks. If the gap is big, it indicates your network could get good accuracy on the training data, while it only achieve a low accuracy on the validation set. It is obvious that your deep model overfits on the training set. Thus, you should increase the regularization strength of deep networks. However, no gap meanwhile at a low accuracy level is not a good thing, which shows your deep model has low learnability. In that case, it is better to increase the model capacity for better results.

Sec. 8: Ensemble

In machine learning, ensemble methods [8] that train multiple learners and then combine them for use are a kind of state-of-the-art learning approach. It is well known that an ensemble is usually significantly more accurate than a single learner, and ensemble methods have already achieved great success in many real-world tasks. In practical applications, especially challenges or competitions, almost all the first-place and second-place winners used ensemble methods.
Here we introduce several skills for ensemble in the deep learning scenario.

Same model, different initialization. Use cross-validation to determine the best hyperparameters, then train multiple models with the best set of hyperparameters but with different random initialization. The danger with this approach is that the variety is only due to initialization.
Top models discovered during cross-validation. Use cross-validation to determine the best hyperparameters, then pick the top few (e.g., 10) models to form the ensemble. This improves the variety of the ensemble but has the danger of including suboptimal models. In practice, this can be easier to perform since it does not require additional retraining of models after cross-validation. Actually, you could directly select several state-of-the-art deep models from Caffe Model Zoo to perform ensemble.
Different checkpoints of a single model. If training is very expensive, some people have had limited success in taking different checkpoints of a single network over time (for example after every epoch) and using those to form an ensemble. Clearly, this suffers from some lack of variety, but can still work reasonably well in practice. The advantage of this approach is that is very cheap.
Some practical examples. If your vision tasks are related to high-level image semantic, e.g., event recognition from still images, a better ensemble method is to employ multiple deep models trained on different data sources to extract different and complementary deep representations. For example in the Cultural Event Recognition challenge in associated with ICCV’15, we utilized five different deep models trained on images of ImageNet, Place Database and the cultural images supplied by the competition organizers. After that, we extracted five complementary deep features and treat them as multi-view data. Combining “early fusion” and “late fusion” strategies described in [7], we achieved one of the best performance and ranked the 2nd place in that challenge. Similar to our work, [9] presented the Stacked NN framework to fuse more deep networks at the same time.

Miscellaneous

In real world applications, the data is usually class-imbalanced: some classes have a large number of images/training instances, while some have very limited number of images. As discussed in a recent technique report [10], when deep CNNs are trained on these imbalanced training sets, the results show that imbalanced training data can potentially have a severely negative impact on overall performance in deep networks. For this issue, the simplest method is to balance the training data by directly up-sampling and down-sampling the imbalanced data, which is shown in [10]. Another interesting solution is one kind of special crops processing in our challenge solution [7]. Because the original cultural event images are imbalanced, we merely extract crops from the classes which have a small number of training images, which on one hand can supply diverse data sources, and on the other hand can solve the class-imbalanced problem. In addition, you can adjust the fine-tuning strategy for overcoming class-imbalance. For example, you can divide your own data set into two parts: one contains the classes which have a large number of training samples (images/crops); the other contains the classes of limited number of samples. In each part, the class-imbalanced problem will be not very serious. At the beginning of fine-tuning on your data set, you firstly fine-tune on the classes which have a large number of training samples (images/crops), and secondly, continue to fine-tune but on the classes with limited number samples.

References & Source Links

A. Krizhevsky, I. Sutskever, and G. E. Hinton. ImageNet Classification with Deep Convolutional Neural Networks. In NIPS, 2012
A Brief Overview of Deep Learning, which is a guest post by Ilya Sutskever.
CS231n: Convolutional Neural Networks for Visual Recognition of Stanford University, held by Prof. Fei-Fei Li and Andrej Karpathy.
K. He, X. Zhang, S. Ren, and J. Sun. Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification. In ICCV, 2015.
B. Xu, N. Wang, T. Chen, and M. Li. Empirical Evaluation of Rectified Activations in Convolution Network. In ICML Deep Learning Workshop, 2015.
N. Srivastava, G. E. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov. Dropout: A Simple Way to Prevent Neural Networks from Overfitting. JMLR, 15(Jun):1929−1958, 2014.
X.-S. Wei, B.-B. Gao, and J. Wu. Deep Spatial Pyramid Ensemble for Cultural Event Recognition. In ICCV ChaLearn Looking at People Workshop, 2015.
Z.-H. Zhou. Ensemble Methods: Foundations and Algorithms. Boca Raton, FL: Chapman & HallCRC/, 2012. (ISBN 978-1-439-830031)
M. Mohammadi, and S. Das. S-NN: Stacked Neural Networks. Project in Stanford CS231n Winter Quarter, 2015.
P. Hensman, and D. Masko. The Impact of Imbalanced Training Data for Convolutional Neural Networks. Degree Project in Computer Science, DD143X, 2015.