Sunday, February 28, 2016

How to Code and Understand DeepMind's Neural Stack Machine

https://iamtrask.github.io/2016/02/25/deepminds-neural-stack-machine/?i=4

Summary: I learn best with toy code that I can play with. This tutorial teaches DeepMind's Neural Stack machine via a very simple toy example, a short python implementation. I will also explain my thought process along the way for reading and implementing research papers from scratch, which I hope you will find useful.
I typically tweet out new blogposts when they're complete at @iamtrask. Feel free to follow if you'd be interested in reading more in the future and thanks for all the feedback!

Thursday, February 25, 2016

Developing a Face Recognition System Using Convolutional Neural Network

http://rnd.azoft.com/developing-face-recognition-system-convolutional-neural-network/

Developing a Face Recognition System Using Convolutional Neural Network

By Ivan Ozhiganov on May 14, 2015

Developing Face Recognition System Using Convolutional Neural Network

Artificial neural networks have become an integral part of our lives and are actively being used in many areas where traditional algorithmic solutions don’t work well or don’t work at all. Neural networks are commonly used for text recognition, automated email spam detection, stock market prediction, contextual online advertising, and more.
One of the most significant and promising areas in which this technology is rapidly evolving is security. For instance, neural networks can be used to monitor suspicious banking transactions, as well as in video surveillance systems or CCTV. Azoft R&D team has experience with such technology: we are currently working on facial recognition software. According to our client's requirements, the facial recognition system should be sufficiently invariant to a range of lighting conditions, face position in space, and changing facial expressions.
The system works via a security camera on a relatively small the number (dozens) of people and should be able to consistently distinguish strangers from the people it has been trained to recognize. Our work on the solution consisted of three main phases: data preparation, neural network training, and recognition.
During the preparation phase, we recorded 300 original images of each person that the system should recognize as "known”. Images with different facial expressions and head position were transformed to a normalized view (minimize position and rotation differences). Based on this data, we generated a larger set of samples using semi-randomized variations of distortion and color filters. As a result, there were 1500-2000 samples per person for neural network training. Whenever the system is receiving a face sample for recognition, it transforms that image to match uniform appearance of the samples that were used for training.
We tried several ways of normalization of the available data. Our system detects the facial features (eyes, nose, chin, cheeks) using the Haar classifier. By analyzing the angle between the eyes, we can compensate for head rotation and select the area of interest to us. Information about head inclination from the previous frame makes it possible to hold the face region while subject’s head is tilting continuously, minimizing iterations of pre-rotating image received from the camera and sending it to classifier.

Face Alignment

After experimenting with color filters, we settled on a method that normalizes brightness and contrast, finds the average histogram of our dataset and applies it to each image individually (including the images used for recognition). Moreover, distortions such as small rotations, stretching, displacement, and specular (mirror-like) reflection can also be applied to our images. This approach, combined with normalization, significantly reduces the system's sensitivity to changes in face position.

Distorted Images, Reduced to a Single Histogram

As a tool for actual recognition process, we used internal implementation of Convolutional Neural Network, one of the most relevant options available today, which has already proven itself in the field of image classification and symbol recognition.
There are several advantages to using Convolutional Neural Networks for our project. First, this type of network is invariant to small movements. Second, it’s the fact that the neural network extracts a set of facial characteristics (feature maps) for each class during the process of training, keeping their relative position in space. We can change the architecture of convolutional network, controlling the number of layers, their size, and the number of feature maps for each layer.

Structure of Our Neural Network

To observe and debug influence of these parameters on the network’s performance, we implemented output of the images generated by Neural net in process of recognition. This approach can be extended by applying De-Convolutional Neural Network — a technique that makes it possible to visualize the contribution of different parts of the input image to feature maps.

Fragment of Debug-Output

Our current face recognition system prototype is trained to recognize faces that are "unknown" to the system, as well as 10 company employees.
As a training set of "unknown" faces, we used the Labeled Faces in the Wild database, as well as 6 sets of faces of employees. Testing was carried out using D-Link DCS 2230 cameras from a distance of 1-4 meters.
According to testing results, our prototype consistently recognizes “known” faces and successfully distinguishes "strangers" in good lighting conditions.
However, the neural network is still sensitive to changes in light direction and overall low light conditions. For more advanced light compensation we are planning to generate a 3D face model, using a smaller set of photos and positions of the main facial features. Produced set would include renders of this model in wide variety of poses and lighting conditions (surface and local shadows) for each subject without actually making thousands of photos.
Azoft team also plans to improve the technique of face localization (right now we are using standard OpenCV cascades), as well as testing using a different camera with recording quality and speed that will not adversely affect facial feature detection and recognition stability.
Another obstacle we need to overcome is the long time necessary to train a convolutional neural network. Even for modest-sized sets of data, it may take a whole day to complete the neural network training. To overcome this shortcoming, we plan to transfer network training to the GPU. This will significantly reduce the time it takes to access and analyze results of changes in the architecture and data. This way, we’ll be able to conduct more experiments in order to determine the optimum data format and network architecture.

Convolutional Neural Networks for Object Detection

http://rnd.azoft.com/convolutional-neural-networks-object-detection/

Convolutional Neural Networks for Object Detection

By Ivan Ozhiganov on February 25, 2016

These days there are so many photo and video surveillance systems in public areas that you would be hard pressed to find someone who hasn’t heard about them. “Big Brother” is watching us — on the roads, in airports, at train stations, when we shop in supermarkets or walk down in the underground. Remote monitoring technologies, photo, and video capture programs are widespread in everyday life but they are also intended for military and other purposes.
As a result, this requires the constant support and development of new methods for automatically processing the visual data captured. In particular, digital image processing often focusses on object detection in the picture, including localization, and recognition of a particular class of objects.
Azoft R&D team has extensive experience in dealing with similar challenges. Specifically, we have implemented a face recognition system and worked on a project for road sign recognition. Today we are going to share our experience on how to do object detection in images using convolutional neural networks. This type of neural networks has successfully proven itself in our past projects.

Project Overview

In the past, the Haar cascade classifier and the LBP-based classifier were the best tools for detecting objects in images. However with the introduction of convolutional neural networks and their proven successful application in computer vision, these cascade classifiers have become a second-best alternative.
We decided to test in practice the effectiveness of convolutional neural networks for object detection in images. As the object of our research, we chose license plate and road sign pictures.

The goal of our project was to develop a 
convolutional neural network model that allows recognition of objects in
 images with a higher quality and performance than cascade classifiers.

The task is broken into four stages:

1. License plate keypoints detection using a convolutional neural network

2. Road signs keypoints detection using a convolutional neural network

3. Road signs detection using a fully convolutional neural network

4. Comparing cascade classifiers and a convolutional neural network for the purpose of license plate detection

We implemented the first three stages simultaneously. Each of them was the responsibility of a single engineer. At the end, we compared the final model of the convolutional neural network of the first stage with cascade classifier methods according to specific parameters.
The performance of various recognition algorithms is particularly important for contemporary mobile devices. For this reason, we decided to test neural networks and cascade classifiers using smartphones.

Implementation

Stage 1. License plate keypoints detection using a convolutional neural network

Over the years, we have used machine learning for several research projects, and for image recognition we’ve often used a dataset of license plate numbers as the learning base. Seeing as our new experiment required the detection of specific identical objects in images, our license plate database was perfectly suited to this task.
First, we decided to train the convolutional neural network to find keypoints of license plates. This was the purpose of the regression analysis. In our case, pixels of the image were independent input parameters while the coordinates of object keypoints were the dependent output parameters. The example of keypoints detection is available in the Image 1.

Image 1: Detecting keypoints of license plates

Training a convolutional neural network to find keypoints requires a dataset with a large number of images of the needed object (no less than 1000). Coordinates of keypoints have to be designated and located in the same order.
Our dataset included several hundred images, however this wasn’t enough for training the network. Therefore, we decided to increase the dataset via augmentation of available data. Before starting augmentation we designated keypoints in the images and divided our dataset into training and control parts.
We applied the following transformations to the initial image during its augmentation:

Shifts
Resize
Rotations relative to the center
Mirror
Affine transformations (they allowed us to rotate and stretch a picture).

Besides this we changed all the images to 320*240 pixels. Take a look at an example of augmentation with transformations in Image 2.

Image 2: Data augmentation

We chose the Caffe framework for the first stage because it is one of the most flexible and fastest frameworks for experiments with convolutional neural networks.
One way to solve a regression task in Caffe is using the special file format HDF5. After normalization of pixel values from 0 to 1 and coordinate values from -1 to 1, we packed the images and coordinates of keypoints in HDF5. You can find more details in our tutorial (see below).
At the beginning, we applied large network architectures (from 4 to 6 convolutional layers and a large amount of convolution kernels). The models with big architectures demonstrated good results but very low performance. For this reason, we decided to set up a simple neural network architecture to keep the quality on the same level.
The final architecture of the convolutional neural network for detecting keypoints of license plates was the following:

Image 3: The architecture of convolutional neural network for detecting the keypoints of license plates

While training the neural network we used the optimization method called Adam. Compared to the Nesterov’s gradient descent, which demonstrated a high value of loss even after the thorough selection of momentum and learning rate parameters, Adam showed the best results. Using the Adam method, the convolutional neural network was trained with higher quality and speed.
We got a neural network that finds the key points of license plates quite well if the plates are not very close to the borders.

Image 4: Detecting the key points of license plates with the obtained convolutional neural network model

For deeper understanding of the convolutional neural network principle, we studied the obtained convolution kernels and feature maps on different layers of the neural network. Considering the final model, convolution kernels demonstrated that the network learned to respond to the sudden swings of brightness, which appear in the borders and symbols of a license plate. Feature maps on the images below show the received model.
The trained kernels on the first and second layers are in pictures 5 and 6.

Image 5: Obtained kernel of the first convolutional layer 7х7

Image 6: Obtained kernel of the second convolutional layer 5х5

Regarding the feature maps in the final model, we took the car picture (Image 7) and transformed it into the picture with gray color gradation to look at the obtained maps after the first (Image 8) and the second (Image 9) convolutional layers.

Image 7: The car picture that is given to the network for viewing feature maps

Image 8: The feature map after the first convolutional layer

Image 9: The feature maps after the second convolutional layer

Finally, we designated the received coordinates of the key points and got the desired image (Image 10).

Image 10: Designated picture after direct passage of convolutional neural network

The convolutional neural network was very effective in detecting the keypoints of license plates. In the majority of cases, the key points of license plate borders were recognized correctly. Therefore, we can highly praise the productivity of the convolutional neural network. The example of license plate detection using an iPhone is available in the video.

Stage 2. Road sign keypoint detection using a convolutional neural network

While training a convolutional neural network to find license plates we simultaneously worked on training a similar network to find road signs with speed limits. We also implemented experiments on the base of the Caffe framework. For training this model we used a dataset with nearly 1700 pictures of signs, which were augmented to 35000.
In doing this we successfully trained a neural network to find road signs on images with a size of 200х200 pixels.
When we tried to detect road signs with a different size on the image the problem appeared. For this reason, we implemented experiments based on simple conditions. A network had to find a white circle against a dark backdrop. We also kept the same image size of 200x200 pixels and made the fixed parameters for the circle size in variation up to 3 times. In other words, the minimum and maximum radius of circles in our dataset differed from each other by 3 times.
Finally, we achieved an appropriate outcome only when the radius of the circle changed no more than 10%. Experiments with gray circles (the spectral range of gray from 0.1 to 1) also demonstrated the same result. Thus, it is an open question as to how to implement object detection when the objects have a different size.
Examples of the last successfully tested model are shown on Image 11 as a group of pictures. As we can see in the pictures, the network learned to distinguish between similar signs. If the network pointed at the image center, then it didn’t find an object.

Results of the convolutional neural network learning - 1

Results of the convolutional neural network learning - 2

Image 11: Results of the convolutional neural network learning

Regarding the detection of road signs, the convolutional neural network demonstrated good results. However, different sizes of objects became an unexpected obstacle. We plan to come back to the search of the solution to this problem in future projects, as it requires detailed research.

Stage 3. Road sign detection using a fully convolutional neural network

Another model that we decided to train to find road signs was a fully convolutional neural network without fully-connected layers. There is an image of a specific size at the input of the fully convolutional neural network, which transforms to a smaller size image at the output. In fact, the network is a non-linear filter with resizing. In other words, the neural network helps to increase the sharpness by removing noise from particular image areas without edge smearing but at the cost of reducing the size of the input image.
The brightness value of pixels is equal to 1 in the output image, where an object is. And any pixels outside of the image have a brightness value equal to 0. Therefore, the brightness value of pixels in the output image is the probability of that pixel belonging to the object.
It is important to consider that the output image is smaller than the input image. That’s why the object coordinates have to be scaled in accordance with the output image size.
We chose the following architecture to train the network: 4 convolutional layers, max-pooling (reducing size via choice of the biggest one) for the first layer. We trained the network using a dataset which is based on images of road signs with speed limits. When we finished training and applied the network to our dataset, we made binarization and found the connected components. Every component is a hypothesis about the sign location. Here are the results:

Image 12: Examples of successfully found road signs

We divided images into 4 groups of 3 pictures (see Image 12). The left side of the image is the input. The right side of the image is the output. We labeled the central image, which shows the result we would like to get ideally.
We applied the binarization to the input image with a threshold of 0.5 and found the connected components. As a result, we found the desired rectangle that designates the location of a road sign. The rectangle is well seen in the left image and the coordinates are scaled back to the input image size.
Nevertheless, this method demonstrated both good results and false positives:

Image 13: Examples of false positives

Overall, we evaluate this model positively. The fully convolutional neural network showed about 80% of correctly found signs from the independent testing sample. The special advantage of this model is that you can find the same two signs and label them with a rectangle. If there are no signs in the picture, the network won’t mark anything.

Stage 4. Comparing cascade classifiers and a convolutional neural network for the purpose of license plate detection

Through our earlier experiments we came to the conclusion that convolutional neural networks are fully comparable with cascade classifiers and even outperform them for some parameters.
To evaluate the quality and performance of different methods for detecting objects on images we use the following characteristics:
Characteristics of methods for object detection

Characteristics of methods for object detection

• Level of precision and recall
Both the convolutional neural networks and Haar classifier demonstrate a high level of precision and recall for detecting objects in images. At the same time, the LBP classifier shows a high level of recall (finding the object quite regularly) but also has a high rate of false positives and low precision.
• Scale invariance
Whereas the Haar cascade classifier and the LBP cascade classifier demonstrate strong invariance to changing the scale of the object in the images, the convolutional neural network cannot manage it in some cases and this shows the low scale invariance.
• Number of attempts before getting a working model
Using cascade classifiers, you need a few attempts to get a working model for object detection. The convolutional neural networks do not give a result so quickly. To achieve the goal you need to perform dozens of experiments.
• Time for processing
A convolutional neural network does not require much time for processing. And the LBP classifier also doesn’t need a lot of processing time. As for the Haar classifier, it takes a significantly longer time for processing.
The average time spent on processing one picture in seconds (not counting the time for capturing and displaying video):

• Consistency with tilting objects
Another great advantage of the convolutional neural network is the consistency with tilting objects. Neither cascade classifiers are consistent with objects which are tilting on the image.
To summarize, we can say that convolutional neural networks are equal or even better than cascade classifiers for some parameters. However the conditions are that there will be a significant number of experiments required to teach the neural network and that the object scale can’t change a lot.

Conclusion

During the process of solving the problem of detecting specific objects in images, we applied two models based on convolutional neural networks. The first one is finding the object’s keypoints in images using a convolutional neural network. The second one is finding the objects in images via a fully convolutional neural network.
Each of the experiments were very labor-intensive and time-consuming. Once more we were convinced that the process of training convolutional neural networks is complicated and requires more investment to obtain reliable results of high quality and performance.
The comparative analysis of cascade classifier methods and the trained convolutional neural network model confirmed our main hypothesis. The convolutional neural network allows localizing objects faster and with higher quality than cascade classifiers if the object won’t change in scale very much. To solve the problem of the low scale invariance, we will try to increase the number of convolutional layers in future projects and use the most representative dataset.
If you are interested in the topic of our research project, take a look at the tutorial we made. Using our tips, you can train a convolutional neural network to detect the objects in images. We will be glad to hear from you about your own experiments.

Thursday, February 18, 2016

Data Science at Instacart

https://tech.instacart.com/data-science-at-instacart/

Data Science at Instacart

Posted on February 17, 2016by Jeremy Stanley

At Instacart, we are revolutionizing the way people buy groceries. We give busy professionals, parents and elderly back valuable time they don’t have to spend shopping. We also give flexible work opportunities to thousands of personal shoppers, and we extend the reach and sales volume for our hundreds of retail partners.
We work incredibly hard to make Instacart easy to use. Our site and app are intuitive – you fill your shopping cart, pick the hour you want delivery to occur in, and then the groceries are handed to you at your doorstep. But achieving this simplicity cost effectively at scale requires an enormous investment in engineering and data science.

What are a few of the teams where data science plays a critical role at Instacart?

Fulfillment
At its core, Instacart is a real-time logistics platform. We are in the business of moving goods from A (a store) to B (your front door) as efficiently and predictably as we can. At any given time, for every market we operate in, we can have thousands of customers expecting delivery in the coming hours. We also can have thousands of shoppers actively working or waiting to be dispatched through our mobile shopper application.
Our fulfillment algorithm decides in real time how to route those shoppers to store locations to pick groceries and deliver them to customers’ door-steps in as little as one hour. We re-compute this optimization every minute, because the world is constantly changing. We have to balance speed (some shoppers shop faster, some stores are less busy) with efficiency (can we deliver multiple orders simultaneously) with quality (does the customer get the exact groceries they want) and timeliness (is the order delivered within the hour it is due – no earlier, no later).

Example of near-optimal combinations of orders for our drivers to deliver (noise added to addresses to protect privacy)

Optimizing multiple objectives while routing thousands of shoppers every minute to fulfill millions of orders is a tremendous data science challenge.

Supply & Demand
Instacart operates a very dynamic and complex fulfillment marketplace. Our consumers place orders (demand) and our shoppers fulfill those orders (supply) in as little as an hour. If supply exceeds demand in a market, we lose money and reduce shopper happiness due to shoppers sitting idle. If instead demand exceeds supply in a market, we lose revenue and customers due to limited availability and busy pricing. Our shoppers work with us to make money, and so they will only be happy if they’re able to keep busy. On the other side, our customers change their lifestyles because of our product, and so we need to be there for them when they want us.

Jeremy and Morgane discussing demand forecasting

Balancing supply and demand requires sophisticated systems for forecasting customer and shopper behavior down to individual store locations by hour of day many days into the future. We then create staffing plans that blend multiple different labor role types to optimize our efficiency while ensuring high availability for our customers. This is made even more challenging by the many different queues we must manage across stores and division of labor. Then in real time, we have to estimate our capacity for orders every time a user visits our site or one of our apps, and then dynamically control availability and busy pricing to smooth demand and create the optimal customer experience.
These systems operate over multiple time horizons, have to solve for multiple competing objectives, and control for many erratic sources of variation (shopper behavior, weather, special events, etc.). We will always have huge opportunities to make improvements here.
Search & Personalization
Instacart isn’t just grocery delivery, we’re creating a better grocery shopping experience. A majority of grocery shopping is about finding the food on your list. In a traditional grocery store, the search engine is the customer’s two feet. At Instacart, it’s a massive algorithm that can mine billions of searches to ensure every product a customer wants is at the edge of their fingertips.
At a physical grocery store, you have to discover new products on your own. But at Instacart, we can curate the experience for you through personalization. What could be more personal than food? We have an intimate relationship with it every day – we put it in our bodies! As much as movie recommendations were critical to the success of Netflix, so too are product recommendations critical to Instacart.

The search team - Sharath, Vincent, Raj and Jon from left — The search team – Sharath, Vincent, Raj and Jon from left

Our consumers order large basket sizes of diverse foods over and over and over again from us. We have more density on our user behavior than any e-commerce company I have ever seen. We are just beginning to use that data to provide incredibly valuable personalized experiences for our users on Instacart in search, in product discovery and in suggestions we make to our users. We A/B test everything, and are thinking really hard about the long term impacts of our changes.
Through investments in search and personalization, Instacart has the opportunity to go beyond convenience in shopping online, and into a future where everyone finds more food they love faster.

How does data science work at Instacart?

We have made the conscious decision to embed our data scientists into our product teams, side-by-side with their engineers, designers and product managers and reporting into the engineering leader for the team. So to answer this question, you first have to understand how engineering works at Instacart.
At Instacart, we place a high value on ownership, speed and ultimately shipping products that have a huge measure-able impact. In engineering, we have organized to optimize for these values. We have many product teams, each of which have full-stack and/or mobile developers, designers, analysts, product managers and engineering leaders dedicated to them. Some teams are only 3 people, others are up to 10. Each team completely owns their ‘product’, and defines their key metrics and sets their roadmap weekly.
We align all of these teams to a small (three or fewer) set of company wide goals that are updated whenever they are achieved or exceeded. These company goals are concise, measurable and time-bound objectives set by our board and executive team that the entire company is committed to. We are obsessively focused on them, and are incredibly transparent about our status and progress on these goals – our CEO sends detailed notes on each weekly.
So every product team answers the question every week “what can we do to have the biggest impact on our company’s goals this week?”. They are then empowered to do whatever they need to within their product to achieve those goals. It’s their ideas, their creativity, their collaboration, their resourcefulness, and their hard work that really moves the needle.

Jeremy presenting on visualization at a Friday engineering lunch

For technology companies, data science can either be an integral component to huge value creation, or an expensive and distracting hobby. Many factors determine the outcome, but how you organize your data scientists is one of the biggest contributing factors. By embedding our data scientists into product teams, we’ve ensured that they are as integral a part of their teams as they can be. As the VP of data science, it’s my job to make sure that the data scientists stay connected, have the mentorship they need, and are having the biggest impact they can within their teams.
The data scientists have a tremendous amount of traction in this model. Their ideas can directly shape not only product innovation, but also data collection and infrastructure innovation to fuel future product ideas. They work directly with their team to bring their products the ‘last mile’ to production. This lets data scientists put new ideas into production in days (from inception), and to rapidly iterate on those ideas as they receive feedback from their consumers. This also gives data scientists a holistic a view of their product, and helps to ensure they are optimizing for the right objectives as effectively as possible.

What are some areas Instacart is expecting to invest in data science in the near future?

Shoppers
Our shoppers are very important to our company. They shop for our customers in the stores, communicate with them live to resolve any issues, and bring the food to their doorstep thousands of times every hour. We can use data science to optimize how we on-board these shoppers and ensure they are successful. We can also optimize and personalize our shopper application to ensure our shoppers can do their jobs quickly and effectively.
Partners
In many companies, advertising is a necessary evil. At Instacart, we have been able to integrate advertising in a way that is a clear win for the advertiser, for the customer and Instacart! Our Deals program lets consumer packaged goods companies offer discounts to our customers (they love them!). Ensuring that customers see the deals they would be most interested in, and that the advertisers get a high ROI for their spend is a huge data science opportunity for Instacart.

What do you look for in Data Scientists when recruiting?

Our organizational structure works because we have amazing talent. You can’t move as fast as we do, with as much distributed ownership as we have, all while solving challenges like ours without the right people.

Bala, Mathieu and Sherin discussing batching (from left)

Our values form the corner-stone of our culture, and these in particular are key for hiring data scientists:
Customer Focus

“Everything we do is in service to our customers. We will work tirelessly to gain the trust of our customers, and to improve their lives. This is the first priority for everyone at Instacart.”

We seek to understand the problems we work on as holistically as we can, and to reason through the physics of the system and how our many constituents (consumers, shoppers, our partners) will experience the changes we drive. We look for candidates that naturally think about problems from a “first principles” basis for what is best for the end user.
Take Ownership

“We will take full ownership of our projects. We take pride in our work and relentlessly execute to get things completely finished.”

In data science, this means improving algorithms and analyzing data are never enough. We own the problem, the solution, the implementation and the measurement – along with everyone else on our team. Simply put, until the desired impact has been measured, our work isn’t done. We look for candidates who crave this opportunity for impact.
Sense of Urgency

“We work extremely fast to drive our projects to completion and we will not rest until they are done.”

Many data science teams think about impact in quarters, months or weeks. Our teams regularly iterate on hard problems in a matter of days – from R&D to implementation and measurement. We look for candidates with a bias towards action, and the fortitude to pursue aggresive goals relentlessly.
Highest Standards

“We put our heart and soul into the projects to deliver the highest quality work product. We only produce work that we are proud of.”

With ownership and a mandate for urgency comes a great responsibility – we must maintain the highest standards possible for the work we produce, as it has the potential to impact millions of consumers, thousands of shoppers and hundreds of retail partners. We look for exceptional candidates who can do amazing work, and are always seeking better ways – be they new algorithms, new processes or new implementations.
Humility

“We appreciate that great ideas can come from anywhere and we will be humble and open minded in considering the ideas of others.”

Many of our best data science ideas have come from Instacart employees in the field – working directly with our shoppers in our stores, or interacting directly with our customers. Ensuring our eyes are wide open to these ideas, and that we collaborate openly within our teams and are always open to questioning our biases and assumptions is critically important. We look for candidates who are conscious of their limitations, and always open to the ideas of others – wherever those ideas may come from.

What Data Science roles is Instacart recruiting for?

We are looking for data scientists with expertise in forecasting, predictive modeling, ads optimization, search and recommendations. We are also looking for operations research scientists with expertise in planning, logistics and real time control systems. Our team uses Python, R, SQL (Postgres & Redshift) and Spark extensively, so mastery of some of those tools and technologies is also helpful.

Wednesday, February 17, 2016

Running your models in production with TensorFlow Serving

http://googleresearch.blogspot.co.id/2016/02/running-your-models-in-production-with.html?m=1

Running your models in production with TensorFlow Serving

Tuesday, February 16, 2016

Posted by Noah Fiedel, Software Engineer Machine learning powers many Google product features, from speech recognition in the Google app to Smart Reply in Inbox to search in Google Photos. While decades of experience have enabled the software industry to establish best practices for building and supporting products, doing so for services based upon machine learning introduces new and interesting challenges. Today, we announce the release of TensorFlow Serving, designed to address some of these challenges. TensorFlow Serving is a high performance, open source serving system for machine learning models, designed for production environments and optimized for TensorFlow.

TensorFlow Serving is ideal for running multiple models, at large scale, that change over time based on real-world data, enabling:

model lifecycle management
experiments with multiple algorithms
efficient use of GPU resources

TensorFlow Serving makes the process of taking a model into production easier and faster. It allows you to safely deploy new models and run experiments while keeping the same server architecture and APIs. Out of the box it provides integration with TensorFlow, but it can be extended to serve other types of models. Here’s how it works. In the simplified, supervised training pipeline shown below, training data is fed to the learner, which outputs a model:

Once a new model version becomes available, upon validation, it is ready to be deployed to the serving system, as shown below.

TensorFlow Serving uses the (previously trained) model to perform inference - predictions based on new data presented by its clients. Since clients typically communicate with the serving system using a remote procedure call (RPC) interface, TensorFlow Serving comes with a reference front-end implementation based on gRPC, a high performance, open source RPC framework from Google. It is quite common to launch and iterate on your model over time, as new data becomes available, or as you improve the model. In fact, at Google, many pipelines run continuously, producing new model versions as new data becomes available.

TensorFlow Serving is written in C++ and it supports Linux. TensorFlow Serving introduces minimal overhead. In our benchmarks we recoded ~100,000 queries per second (QPS) per core on a 16 vCPU Intel Xeon E5 2.6 GHz machine, excluding gRPC and the TensorFlow inference processing time. We are excited to share this important component of TensorFlow today under the Apache 2.0 open source license. We would love to hear your questions and feature requests on Stack Overflow and GitHub respectively. To get started quickly, clone the code from github.com/tensorflow/serving and check out this tutorial. You can expect to keep hearing more about TensorFlow as we continue to develop what we believe to be one of the best machine learning toolboxes in the world. If you'd like to stay up to date, follow @googleresearch or +ResearchatGoogle, and keep an eye out for Jeff Dean's keynote address at GCP Next 2016 in March.

Wednesday, February 10, 2016

Drones learn to search forest trails for lost people

http://phys.org/news/2016-02-drones-forest-trails-lost-people.html

Researchers at the University of Zurich, the Università della Svizzera italiana, and the University of Applied Sciences and Arts of Southern Switzerland have developed software enabling drones to autonomously detect and follow forest paths. With the new drones, missing persons can be found and rescued quickly in forests and mountain areas.

Every year, thousands of people lose their way in forests and mountain areas. In Switzerland alone, emergency centers respond to around 1,000 calls annually from injured and lost hikers. But drones can effectively complement the work of rescue services teams. Because they are inexpensive and can be rapidly deployed in large numbers, they substantially reduce the response time and the risk of injury to missing persons and rescue teams alike.
A group of researchers from the Dalle Molle Institute for Artificial Intelligence and the University of Zurich has developed artificial intelligence software to teach a small quadrocopter to autonomously recognize and follow forest trails. A premiere in the fields of artificial intelligence and robotics, this success means drones could soon be used in parallel with rescue teams to accelerate the search for people lost in the wild.
Breakthrough: Drone Flies Autonomously in Demanding Terrain
"While drones flying at high altitudes are already being used commercially, drones cannot yet fly autonomously in complex environments, such as dense forests. In these environments, any little error may result in a crash, and robots need a powerful brain in order to make sense of the complex world around them," says Prof. Davide Scaramuzza from the University of Zurich.
The drone used by the Swiss researchers observes the environment through a pair of small cameras, similar to those used in smartphones. Instead of relying on sophisticated sensors, their drone uses very powerful artificial-intelligence algorithms to interpret the images to recognize man-made trails. If a trail is visible, the software steers the drone in the corresponding direction. "Interpreting an image taken in a complex environment such as a forest is incredibly difficult for a computer," says Dr. Alessandro Giusti from the Dalle Molle Institute for Artificial Intelligence. "Sometimes even humans struggle to find the trail!"
Successful Deep Neural Network Application
The Swiss team solved the problem using a so-called Deep Neural Network, a computer algorithm that learns to solve complex tasks from a set of "training examples," much like a brain learns from experience. In order to gather enough data to "train" their algorithms, the team hiked several hours along different trails in the Swiss Alps and took more than 20 thousand images of trails using cameras attached to a helmet. The effort paid off: When tested on a new, previously unseen trail, the deep neural network was able to find the correct direction in 85% of cases; in comparison, humans faced with the same task guessed correctly 82% of the time.
Professor Juergen Schmidhuber, Scientific Director at the Dalle Molle Institute for Artificial Intelligence says: "Our lab has been working on deep learning in neural networks since the early 1990s. Today I am happy to find our lab's methods not only in numerous real-world applications such as speech recognition on smartphones, but also in lightweight robots such as drones. Robotics will see an explosion of applications of deep neural networks in coming years."
The research team warns that much work is still needed before a fully autonomous fleet will be able to swarm forests in search of missing people. Professor Luca Maria Gambardella, director of the "Dalle Molle Institute for Artificial Intelligence" in Lugano remarks: "Many technological issues must be overcome before the most ambitious applications can become a reality. But small flying robots are incredibly versatile, and the field is advancing at an unseen pace. One day robots will work side by side with human rescuers to make our lives safer." Prof. Davide Scaramuzza from the University of Zurich adds: "Now that our drones have learned to recognize and follow forest trails, we must teach them to recognize humans."

Explore further: Amazon drone technology almost there, regulation nonexistent

More information: Alessandro Giusti et al. A Machine Learning Approach to Visual Perception of Forest Trails for Mobile Robots, IEEE Robotics and Automation Letters (2015). DOI: 10.1109/LRA.2015.2509024

Read more at: http://phys.org/news/2016-02-drones-forest-trails-lost-people.html#jCp

Explore further: Amazon drone technology almost there, regulation nonexistent

Hearing Heartbeat in Audio and Video: A Deep Learning Project

http://www.samcoope.com/posts/reading-faces

Hearing Heartbeat in Audio and Video: A Deep Learning Project

Monday, February 1, 2016

Last term, I ended up taking three machine learning courses, along with a couple of others. Out of all of the stuff I did last term, the biggest and best thing I did was work with a few friends on a group project. This project was:

Automatic Estimation of Heart rate from Audio and Video

Now this might sound like nonsense to most people, "how can you hear my heartrate? My heart doesn't beat that loudly! And you you can detect my heart rate from video? witchcraft!?". In fact you would be almost right.
Several studies like this one have shown that it is completely possible. In short: Your heartbeat affects the nerve that operates your voicebox, and the change in coloration of a person's face does indeed match a heartbeat (your face becomes more red when your heart pumps blood to it).
So we weren't trying to analyse facial expression or physical exertion to estimate heart rate, but rather the physical waveform produced in your speech and facial coloration which correspond to a persons heartbeat.

Why Heartrate?

1. Not Much Work Done in the Area

Currently, all of the heavyweight research into this area of machine learning comes from companies who want to read their customers better, rather than help them. This means that detecting emotional state has already had extensive research and is already being used in software. Microsoft is already selling a service which detects the emotion of people in video or images.

2. Potential Benefit

There are a lots more theoretical (medical) uses for analysing speech and video of people's face. Some people theorise that depression can be detected in people (or at least help doctors with a diagnosis), heart arrhythmia and potentially a lot more.
Imagine combining these abilities into one product. In the future a doctor could use something like this to help detect subtle symptoms in the way a person presents themselves. A whole new quantification of emotional wellbeing (which could be accurately recorded) would completely change how so many areas of medicine work.
Being able to detect heart rate (and other stuff) from speech or video of a person's face has a whole bunch of benefits:

Remote: No one needs to be next to the patient to take measurements.
Unintrusive: No device needs to be placed on the patient's body.
Retroactive: It would be possible to look at past audio and video recordings (that we have, like, a ton of) to help in diagnosis. Videos of a patient as a child might reveal signs of a symptom, or could provide new evidence in a court case.

What We Did

The final product was a website where you could upload a video (with a person talking in it) and have it streamed back to you in real time with the heart beat of the person in the video. We used both the audio and video methods of estimating heart rate to get as good an estimate as possible. Unfortunately we have not hosted it at the moment, as it would require some powerful commodity hardware to run the server.
Here is a screenshot of our video streaming service. In retrospect I think superimposing a big red square over the person's face looks a bit tacky.

We were given a dataset of people talking about embarrasing/happy/sad/funny moments of their life with their pulse measured. Naturally this meant that the heart rate of the subject would change during the session, as they would get more tense or relaxed depending on what they were saying. Our job was to use this dataset to train our models.
Most of my work on the project was video analysis, data preparation and a little bit of help with the neural network training, so I'll show you the stuff I know about in the project.
Enough talk, lets show you the interesting stuff that we did.

Face Tracker

For video analysis, the aim was to monitor the bloodflow in someone's face. We did this by looking at the green intensity of the subjects' face (based on this paper).
In order to do this, you need to track the position of a person's face so you know which pixels are showing someone's face. This is a bit more tricky than it sounds: most face trackers out there fail really, really soft. What that means is that if the face tracker is having a hard time finding the subject in the video, they usually then start tracking some blip of dust on the wall, which would completely contaminate the data we have. (look for a face tracker doing this).
Also, many of these face trackers are quite processor heavy, which makes the dream of using this in a real time streaming service very difficult. Some of these trackers could run at around 10fps on a desktop machine, which is just way too slow.
So instead we used the fast and crappy face tracker in OpenCV. The problem with this face tracker is that it is completely memoryless. This means that it doesn't take into account the previous frames to find the face, which makes it a very wobbly tracker. Also, even when it looked like the face tracker was tracking the subject perfectly, it would drop an odd frame, which would flag up the subject as difficult to track, and move on to the next one.
To solve the wobbliness, we added a layer on top of the tracker which uses moving averages for the size and position of the face to smooth out the tracking. The frame loss was solved by giving the adding a ten frame drop limit, so if the tracker loses the face in the next frame, it uses the position of the previous frame for collecting the green intensity for up to ten consecutive frames.
Here you can see an example of the face tracker working on a subject in our training data. Note the green square is where the face tracker thinks the face is, and the cyan rectangle is the region of pixels relative to the face-tracker which the green intensity is extracted from the face.
It can track pixelated faces!

Neural Nets

We used keras for making and running our deep neural nets. If you are curious to try some deep learning, this is all of the benefit of high performance CUDA in a nicely made python framework.
These were the four different approaches to audio we attempted. For video, we only used the CNN-RNN architecture.

First Idea

We initially tried to apply a similar strategy to Dieleman and Schrauwen, who used spectrograms of subjects speaking to detect heart rate. For video, we used the time series data of the green intensity of someone's face to create the spectrograms, here are what the spectrograms looked like for audio and video.
The idea is that convolutional neural nets are very good at image classification, and applying these practices to look at a spectrogram might be a good way at detecting heart rate.
This was not very successful for us. Our best result was 0.68 product moment correlation coefficient.

Recurrent Nets

We then tried recurrent neural networks, which proved to be much better at detecting heart rate, but still not great. For audio we sliced the spectrograms into small chunks, and for video, we just used the time series of green intensity of the face. These turned out to be much more promising, but still not as good as we hoped.

Thoughts

Can you detect heart rate from a video of a person?
Sometimes in research based projects the answer can be as simple as a yes or no, but in our case, we can't really tell. There are so many factors which could have contaminated the data a little bit, such as some of the subjects wearing make-up, which would block coloration of the face coming through. I do believe that this is an avenue of research that could really do with more people interested in it, it has the potential to have a lot of commercial and medical impact.

Appendix

The slides for our final presentation
the source for the project is here
we also made a goofy video of an early prototype

Tuesday, February 9, 2016

awesome public datasets

https://github.com/caesar0301/awesome-public-datasets

Monday, February 8, 2016

www.wildml.com Deep Learning Glossary

http://www.wildml.com/deep-learning-glossary/

Deep Learning Glossary

This glossary is work in progress and I am planning to continuously update it. If you find a mistake or think an important term is missing, please let me know in the comments or via email.
Deep Learning terminology can be quite overwhelming to newcomers. This glossary tries to define commonly used terms and link to original references and additional resources to help readers dive deeper into a specific topic.
The boundary between what is Deep Learning vs. “general” Machine Learning terminology is quite fuzzy. I am trying to keep the glossary specific to Deep Learning, but these decisions are somewhat arbitrary. For example, I am not including “cross-validation” here because it’s a generic technique uses all across Machine Learning. However, I’ve decided to include terms such as softmax or word2vec because they are often associated with Deep Learning even though they are not Deep Learning techniques.

Activation Function

To allow Neural Networks to learn complex decision boundaries, we apply a nonlinear activation function to some of its layers. Commonly used functions include sigmoid, tanh, ReLU (Rectified Linear Unit) and variants of these.

Adadelta

Adadelta is a gradient descent based learning algorithm that adapts the learning rate per parameter over time. It was proposed as an improvement over Adagrad, which is more sensitive to hyperparameters and may decrease the learning rate too aggressively. Adadelta It is similar to rmsprop and can be used instead of vanilla SGD.

Adagrad

Adagrad is an adaptive learning rate algorithms that keeps track of the squared gradients over time and automatically adapts the learning rate per-parameter. It can be used instead of vanilla SGD and is particularly helpful for sparse data, where it assigns a higher learning rate to infrequently updated parameters.

Adam

Adam is an adaptive learning rate algorithm similar to rmsprop, but updates are
directly estimated using a running average of the first and second moment of the gradient and also include a bias correction term.

Affine Layer

A fully-connected layer in a Neural Network. Affine means that each neuron in the previous layer is connected to each neuron in the current layer. In many ways, this is the “standard” layer of a Neural Network. Affine layers are often added on top of the outputs of Convolutional Neural Networks or Recurrent Neural Networks before making a final prediction. An affine layer is typically of the form y = f(Wx + b) where x are the layer inputs, W the parameters, b a bias vector, and f a nonlinear activation function

Attention Mechanism

Attention Mechanisms are inspired by human visual attention, the ability to focus on specific parts of an image. Attention mechanisms can be incorporated in both Language Processing and Image Recognition architectures to help the network learn what to “focus” on when making predictions.

Attention and Memory in Deep Learning and NLP

Alexnet

Alexnet is the name of the Convolutional Neural Network architecture that won the ILSVRC 2012 competition by a large margin and was responsible for a resurgence of interest in CNNs for Image Recognition. It consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax. Alexnet was introduced in ImageNet Classification with Deep Convolutional Neural Networks.

Autoencoder

An Autoencoder is a Neural Network model whose goal is to predict the input itself, typically through a “bottleneck” somewhere in the network. By introducing a bottleneck, we force the network to learn a lower-dimensional representation of the input, effectively compressing the input into a good representation. Autoencoders are related to PCA and other dimensionality reduction techniques, but can learn more complex mappings due to their nonlinear nature. A wide range of autoencoder architectures exist, including Denoising Autoencoders, Variational Autoencoders, or Sequence Autoencoders.

Average-Pooling

Average-Pooling is a pooling technique used in Convolutional Neural Networks for Image Recognition. It works by sliding a window over patches of features, such as pixels, and taking the average of all values within the window. It compresses the input representation into a lower-dimensional representation.

Backpropagation

Backpropagation is an algorithm to efficiently calculate the gradients in a Neural Network, or more generally, a feedforward computational graph. It boils down to applying the chain rule of differentiation starting from the network output and propagating the gradients backward. The first uses of backpropagation go back to Vapnik in the 1960’s, but Learning representations by back-propagating errors is often cited as the source.

Calculus on Computational Graphs: Backpropagation

Backpropagation Through Time (BPTT)

Backpropagation Through Time (paper) is the Backpropagation algorithm applied to Recurrent Neural Networks (RNNs). BPTT can be seen as the standard backpropagation algorithm applied to an RNN, where each time step represents a layer and the parameters are shared across layers. Because an RNN shares the same parameters across all time steps, the errors at one time step must be backpropagated “through time” to all previous time steps, hence the name. When dealing with long sequences (hundreds of inputs), a truncated version of BPTT is often used to reduce the computational cost. Truncated BPTT stops backpropagating the errors after a fixed number of steps.

Backpropagation Through Time: What It Does and How to Do It

Batch Normalization

Batch Normalization is a technique that normalizes layer inputs per mini-batch. It speed up training, allows for the usage of higher learner rates, and can act as a regularizer. Batch Normalization has been found to be very effective for Convolutional and Feedforward Neural Networks but hasn’t been successfully applied to Recurrent Neural Networks.

Bidirectional RNN

A Bidirectional Recurrent Neural Network is a type of Neural Network that contains two RNNs going into different directions. The forward RNN reads the input sequence from start to end, while the backward RNN reads it from end to start. The two RNNs are stacked on top of each others and their states are typically combined by appending the two vectors. Bidirectional RNNs are often used in Natural Language problems, where we want to take the context from both before and after a word into account before making a prediction.

Bidirectional Recurrent Neural Networks

Caffe

Caffe is a deep learning framework developed by the Berkeley Vision and Learning Center. Caffe is particularly popular and performant for vision tasks and CNN models.

Categorical Cross-Entropy Loss

The categorical cross-entropy loss is also known as the negative log likelihood. It is a popular loss function for categorization problems and measures the similarity between two probability distributions, typically the true labels and the predicted labels. It is given by L = -sum(y * log(y_prediction)) where y is the probability distribution of true labels (typically a one-hot vector) and y_prediction is the probability distribution of the predicted labels, often coming from a softmax.

Channel

Input data to Deep Learning models can have multiple channels. The canonical examples are images, which have red, green and blue color channels. A image can be represented as a 3-dimensional Tensor with the dimensions corresponding to channel, height, and width. Natural Language data can also have multiple channels, in the form of different types of embeddings for example.

Convolutional Neural Network (CNN, ConvNet)

A CNN uses convolutions to connected extract features from local regions of an input. Most CNNs contain a combination of convolutional, pooling and affine layers. CNNs have gained popularity particularly through their excellent performance on visual recognition tasks, where they have been setting the state of the art for several years.

Deep Belief Network (DBN)

DBNs are a type of probabilistic graphical model that learn a hierarchical representation of the data in an unsupervised manner. DBNs consist of multiple hidden layers with connections between neurons in each successive pair of layers. DBNs are built by stacking multiple RBNs on top of each other and training them one by one.

A fast learning algorithm for deep belief nets

Deep Dream

A technique invented by Google that tries to distill the knowledge captured by a deep Convolutional Neural Network. The technique can generate new images, or transform existing images and give them a dreamlike flavor, especially when applied recursively.

Dropout

Dropout is a regularization technique for Neural Networks that prevents overfitting. It prevents neurons from co-adapting by randomly setting a fraction of them to 0 at each training iteration. Dropout can be interpreted in various ways, such as randomly sampling from an exponential number of different networks. Dropout layers first gained popularity through their use in CNNs, but have since been applied to other layers, including input embeddings or recurrent networks.

Embedding

An embedding maps an input representation, such as a word or sentence, into a vector. A popular type of embedding are word embeddings such as word2vec or GloVe. We can also embed sentences, paragraphs or images. For example, by mapping images and their textual descriptions into a common embedding space and minimizing the distance between them, we can match labels with images. Embeddings can be learned explicitly, such as in word2vec, or as part of a supervised task, such as Sentiment Analysis. Often, the input layer of a network is initialized with pre-trained embeddings, which are then fine-tuned to the task at hand.

Exploding Gradient Problem

The Exploding Gradient Problem is the opposite of the Vanishing Gradient Problem. In Deep Neural Networks gradients may explode during backpropagation, resulting number overflows. A common technique to deal with exploding gradients is to perform Gradient Clipping.

On the difficulty of training recurrent neural networks

Fine-Tuning

Fine-Tuning refers to the technique of initializing a network with parameters from another task (such as an unsupervised training task), and then updating these parameters based on the task at hand. For example, NLP architecture often use pre-trained word embeddings like word2vec, and these word embeddings are then updated during training based for a specific task like Sentiment Analysis.

Gradient Clipping

Gradient Clipping is a technique to prevent exploding gradients in very deep networks, typically Recurrent Neural Networks. There exist various ways to perform gradient clipping, but the a common one is to normalize the gradients of a parameter vector when its L2 norm exceeds a certain threshold according to new_gradients = gradients * threshold / l2_norm(gradients).

On the difficulty of training recurrent neural networks

GloVe

GloVe is an unsupervised learning algorithm for obtaining vector representations (embeddings) for words. GloVe vectors serve the same purpose as word2vec but have different vector representations due to being trained on co-occurrence statistics.

GloVe: Global Vectors for Word Representation

GoogleLeNet

The name of the Convolutional Neural Network architecture that won the ILSVRC 2014 challenge. The network uses Inception modules to reduce the parameters and improve the utilization of the computing resources inside the network.

Going Deeper with Convolutions

GRU

The Gated Recurrent Unit is a simplified version of an LSTM unit with fewer parameters. Just like an LSTM cell, it uses a gating mechanism to allow RNNs to efficiently learn long-range dependency by preventing the vanishing gradient problem. The GRU consists of a reset and update gate that determine which part of the old memory to keep vs. update with new values at the current time step.

Highway Layer

A Highway Layer (paper) is a type of Neural Network layer that uses a gating mechanism to control the information flow through a layer. Stacking multiple Highway Layers allows for training of very deep networks. Highway Layers work by learning a gating function that chooses which parts of the inputs to pass through and which parts to pass through a transformation function, such as a standard affine layer for example. The basic formulation of a Highway Layer is T * h(x) + (1 - T) * x, where T is the learned gating function with values between 0 and 1, h(x) is an arbitrary input transformation and x is the input. Note that all of these must have the same size.

ICML

The International Conference for Machine Learning, a top-tier machine learning conference.

ILSVRC

The ImageNet Large Scale Visual Recognition Challenge evaluates algorithms for object detection and image classification at large scale. It is the most popular academic challenge in computer vision. Over the past years, Deep Learning techniques have led to a significant reduction in error rates, from 30% to less than 5%, beating human performance on several classification tasks.

Inception Module

Inception Modules are used in Convolutional Neural Networks to allow for more efficient computation and deeper Networks trough a dimensionality reduction with stacked 1×1 convolutions.

Going Deeper with Convolutions

Keras

Kears is a Python-based Deep Learning library that includes many high-level building blocks for deep Neural Networks. It can run on top of either TensorFlow or Theano.

LSTM

Long Short-Term Memory networks were invented to prevent the vanishing gradient problem in Recurrent Neural Networks by using a memory gating mechanism. Using LSTM units to calculate the hidden state in an RNN we help to the network to efficiently propagate gradients and learn long-range dependencies.

Max-Pooling

A pooling operations typically used in Convolutional Neural Networks. A max-pooling layer selects the maximum value from a patch of features. Just like a convolutional layer, pooling layers are parameterized by a window (patch) size and stride size. For example, we may slide a window of size 2×2 over a 10×10 feature matrix using stride size 2, selecting the max across all 4 values within each window, resulting in a new 5×5 feature matrix. Pooling layers help to reduce the dimensionality of a representation by keeping only the most salient information, and in the case of image inputs, they provide basic invariance to translation (the same maximum values will be selected even if the image is shifted by a few pixels). Pooling layers are typically inserted between successive convolutional layers.

MNIST

The MNIST data set is the perhaps most commonly used Image Recognition dataset. It consists of 60,000 training and 10,000 test examples of handwritten digits. Each image is 28×28 pixels large. State of the art models typically achieve accuracies of 99.5% or higher on the test set.

Momentum

Momentum is an extension to the Gradient Descent Algorithm that accelerates or damps the parameter updates. In practice, including a momentum term in the gradient descent updates leads to better convergence rates in Deep Networks.

Learning representations by back-propagating errors

Multilayer Perceptron (MLP(

A Multilayer Perceptron is a Feedforward Neural Network with multiple fully-connected layers that use nonlinear activation functions to deal with data which is not linearly separable. An MLP is the most basic form of a multilayer Neural Network, or a deep Neural Networks if it has more than 2 layers.

Negative Log Likelihood (NLL)

See Categorical Cross Entropy Loss.

Neural Machine Translation (NMT)

An NMT system uses Neural Networks to translate between languages, such as English and French. NMT systems can be trained end-to-end using bilingual corpora, which differs from traditional Machine Translation systems that require hand-crafted features and engineering. NMT systems are typically implemented using encoder and decoder recurrent neural networks that encode a source sentence and produce a target sentence, respectively.

Neural Turing Machine (NTM)

NMTs are Neural Network architectures that can infer simple algorithms from examples. For example, a NTM may learn a sorting algorithm through example inputs and outputs. NTMs typically learn some form of memory and attention mechanism to deal with state during program execution.

Neural Turing Machines

Nonlinearity

See Activation Function.

Noise-contrastive estimation (NCE)

Noise-contrastive estimation is a sampling loss typically used to train classifiers with a large output vocabulary. Calculating the softmax over a large number of possible classes is prohibitively expensive. Using NCE, we can reduce the problem to binary classification problem by training the classifier to discriminate between samples from the “real” distribution and an artificially generated noise distribution.

Pooling

See Max-Pooling or Average-Pooling.

Restricted Boltzmann Machine (RBN)

RBMs are a type of probabilistic graphical model that can be interpreted as a stochastic artificial neural network. RBNs learn a representation of the data in an unsupervised manner. An RBN consists of visible and hidden layer, and connections between binary neurons in each of these layers. RBNs can be efficiently trained using Contrastive Divergence, an approximation of gradient descent.

Recurrent Neural Network (RNN)

A RNN models sequential interactions through a hidden state, or memory. It can take up to N inputs and produce up to N outputs. For example, an input sequence may be a sentence with the outputs being the part-of-speech tag for each word (N-to-N). An input could be a sentence, and the output a sentiment classification of the sentence (N-to-1). An input could be a single image, and the output could be a sequence of words corresponding to the description of an image (1-to-N). At each time step, an RNN calculates a new hidden state (“memory”) based on the current input and the previous hidden state. The “recurrent” stems from the facts that at each step the same parameters are used and the network performs the same calculations based on different inputs.

Recursive Neural Network

Recursive Neural Networks are a generalization of Recurrent Neural Networks to a tree-like structure. The same weights are applied at each recursion. Just like RNNs, Recursive Neural Networks can be trained end-to-end using backpropagation. While it is possible to learn the tree structure as part of the optimization problem, Recursive Neural Networks are often applied to problem that already have a predefined structure, like a parse tree in Natural Language Processing.

Parsing Natural Scenes and Natural Language with Recursive Neural Networks

ReLU

Short for Rectified Linear Unit(s). ReLUs are often used as activation functions in Deep Neural Networks. They are defined by f(x) = max(0, x). The advantages of ReLUs over functions like tanh include that they tend to be sparse (their activation easily be set to 0), and that they suffer less from the vanishing gradient problem. ReLUs are the most commonly used activation function in Convolutional Neural Networks. There exist several variations of ReLUs, such as Leaky ReLUs, Parametric ReLU (PReLU) or a smoother softplus approximation.

ResNet

Deep Residual Networks won the ILSVRC 2015 challenge. These networks work by introducing shortcut connection across stacks of layers, allowing the optimizer to learn “easier” residual mappings instead of the more complicated original mappings. These shortcut connections are similar to Highway Layers, but they are data-independent and don’t introduce additional parameters or training complexity. ResNets achieved a 3.57% error rate on the ImageNet test set.

Deep Residual Learning for Image Recognition

RMSProp

RMSProp is a gradient-based optimization algorithm. It is similar to Adagrad, but introduces an additional decay term to counteract Adagrad’s rapid decrease in learning rate.

Seq2Seq

A Sequence-to-Sequence model reads a sequence (such as a sentence) as an input and produces another sequence as an output. It differs from a standard RNN in that the input sequence is completely read before the network starts producing any output. Typically, seq2seq models are implemented using two RNNs, functioning as encoders and decoders. Neural Machine Translation is a typical example of a seq2seq model.

Sequence to Sequence Learning with Neural Networks

SGD

Stochastic Gradient Descent (Wikipedia) is a gradient-based optimization algorithm that is used to learn network parameters during the training phase. The gradients are typically calculated using the backpropagation algorithm. In practice, people use the minibatch version of SGD, where the parameter updates are performed based on a batch instead of a single example, increasing computational efficiency. Many extensions to vanilla SGD exist, including Momentum, Adagrad, rmsprop, Adadelta or Adam.

Softmax

The softmax function is typically used to convert a vector of raw scores into class probabilities at the output layer of a Neural Network used for classification. It normalizes the scores by exponentiating and dividing by a normalization constant. If we are dealing with a large number of classes, a large vocabulary in Machine Translation for example, the normalization constant is expensive to compute. There exist various alternatives to make the computation more efficient, including Hierarchical Softmax or using a sampling-based loss such as NCE.

TensorFlow

TensorFlow is an open source C++/Python software library for numerical computation using data flow graphs, particularly Deep Neural Networks. It was created by Google. In terms of design, it is most similar to Theano, and lower-level than Caffe or Keras.

Theano

Theano is a Python library that allows you to define, optimize, and evaluate mathematical expressions. It contains many building blocks for deep neural networks. Theano is a low-level library similar to Tensorflow. Higher-level libraries include Keras and Caffe.

Vanishing Gradient Problem

The vanishing gradient problem arises in very deep Neural Networks, typically Recurrent Neural Networks, that use activation functions whose gradients tend to be small (in the range of 0 from 1). Because these small gradients are multiplied during backpropagation, they tend to “vanish” throughout the layers, preventing the network from learning long-range dependencies. Common ways to counter this problem is to use activation functions like ReLUs that do not suffer from small gradients, or use architectures like LSTMs that explicitly combat vanishing gradients. The opposite of this problem is called the exploding gradient problem.

On the difficulty of training recurrent neural networks

VGG

VGG refers to convolutional neural network model that secured the first and second place in the 2014 ImageNet localization and classification tracks, respectively. The VGG model consist of 16–19 weight layers and uses small convolutional filters of size 3×3 and 1×1.

Very Deep Convolutional Networks for Large-Scale Image Recognition

word2vec

word2vec is an algorithm and tool to learn word embeddings by trying to predict the context of words in a document. The resulting word vectors have some interesting properties, for example vector('queen') ~= vector('king') - vector('man') + vector('woman'). Two different objectives can be used to learn these embeddings: The Skip-Gram objective tries to predict a context from on a word, and the CBOW objective tries to predict a word from its context.