Summary: I learn best with toy code that I can play with. This tutorial teaches DeepMind's Neural Stack machine via a very simple toy example, a short python implementation. I will also explain my thought process along the way for reading and implementing research papers from scratch, which I hope you will find useful.
I typically tweet out new blogposts when they're complete at @iamtrask. Feel free to follow if you'd be interested in reading more in the future and thanks for all the feedback!
Developing a Face Recognition System Using Convolutional Neural Network
By Ivan Ozhiganov on May 14, 2015
Artificial neural networks have become
an integral part of our lives and are actively being used in many areas
where traditional algorithmic solutions don’t work well or don’t work at
all. Neural networks are commonly used for text recognition, automated
email spam detection, stock market prediction, contextual online
advertising, and more. One of the most significant and promising areas in which this
technology is rapidly evolving is security. For instance, neural
networks can be used to monitor suspicious banking transactions, as well
as in video surveillance systems or CCTV. Azoft R&D team
has experience with such technology: we are currently working on facial
recognition software. According to our client's requirements, the facial recognition system should be sufficiently invariant to a range of lighting conditions, face position in space, and changing facial expressions.
The system works via a security camera on a relatively small the
number (dozens) of people and should be able to consistently distinguish
strangers from the people it has been trained to recognize. Our work on
the solution consisted of three main phases: data preparation, neural
network training, and recognition.
During the preparation phase, we recorded 300 original images of each
person that the system should recognize as "known”. Images with
different facial expressions and head position were transformed to a
normalized view (minimize position and rotation differences). Based on
this data, we generated a larger set of samples using semi-randomized
variations of distortion and color filters. As a result, there were
1500-2000 samples per person for neural network training. Whenever the
system is receiving a face sample for recognition, it transforms that
image to match uniform appearance of the samples that were used for
training.
We tried several ways of normalization of the available data. Our
system detects the facial features (eyes, nose, chin, cheeks) using the
Haar classifier. By analyzing the angle between the eyes, we can
compensate for head rotation and select the area of interest to us.
Information about head inclination from the previous frame makes it
possible to hold the face region while subject’s head is tilting
continuously, minimizing iterations of pre-rotating image received from
the camera and sending it to classifier.
Face Alignment
After experimenting with color filters, we settled on a method that
normalizes brightness and contrast, finds the average histogram of our
dataset and applies it to each image individually (including the images
used for recognition). Moreover, distortions such as small rotations,
stretching, displacement, and specular (mirror-like) reflection can also
be applied to our images. This approach, combined with normalization,
significantly reduces the system's sensitivity to changes in face
position.
Distorted Images, Reduced to a Single Histogram
As a tool for actual recognition process, we used internal
implementation of Convolutional Neural Network, one of the most relevant
options available today, which has already proven itself in the field
of image classification and symbol recognition.
There are several advantages to using Convolutional Neural Networks
for our project. First, this type of network is invariant to small
movements. Second, it’s the fact that the neural network extracts a set
of facial characteristics (feature maps) for each class during the
process of training, keeping their relative position in space. We can
change the architecture of convolutional network, controlling the number
of layers, their size, and the number of feature maps for each layer.
Structure of Our Neural Network
To observe and debug influence of these parameters on the network’s
performance, we implemented output of the images generated by Neural net
in process of recognition. This approach can be extended by applying
De-Convolutional Neural Network — a technique that makes it possible to
visualize the contribution of different parts of the input image to
feature maps.
Fragment of Debug-Output
Our current face recognition system prototype is trained to recognize
faces that are "unknown" to the system, as well as 10 company
employees.
As a training set of "unknown" faces, we used the Labeled Faces in the
Wild database, as well as 6 sets of faces of employees. Testing was
carried out using D-Link DCS 2230 cameras from a distance of 1-4 meters.
According to testing results, our prototype consistently recognizes
“known” faces and successfully distinguishes "strangers" in good
lighting conditions.
However, the neural network is still sensitive to changes in light
direction and overall low light conditions. For more advanced light
compensation we are planning to generate a 3D face model, using a
smaller set of photos and positions of the main facial features.
Produced set would include renders of this model in wide variety of
poses and lighting conditions (surface and local shadows) for each
subject without actually making thousands of photos.
Azoft team also plans to improve the technique of face localization
(right now we are using standard OpenCV cascades), as well as testing
using a different camera with recording quality and speed that will not
adversely affect facial feature detection and recognition stability.
Another obstacle we need to overcome is the long time necessary to
train a convolutional neural network. Even for modest-sized sets of
data, it may take a whole day to complete the neural network training.
To overcome this shortcoming, we plan to transfer network training to
the GPU. This will significantly reduce the time it takes to access and
analyze results of changes in the architecture and data. This way, we’ll
be able to conduct more experiments in order to determine the optimum
data format and network architecture.
Convolutional Neural Networks for Object Detection
By Ivan Ozhiganov on February 25, 2016
These days there are so many photo and video surveillance systems in
public areas that you would be hard pressed to find someone who hasn’t
heard about them. “Big Brother” is watching us — on the roads, in
airports, at train stations, when we shop in supermarkets or walk down
in the underground. Remote monitoring technologies, photo, and video
capture programs are widespread in everyday life but they are also
intended for military and other purposes.
As a result, this requires the constant support and development of
new methods for automatically processing the visual data captured. In
particular, digital image processing often focusses on object detection
in the picture, including localization, and recognition of a particular
class of objects. Azoft R&D team has extensive experience in dealing with similar challenges. Specifically, we have implemented a face recognition system and worked on a project for road sign recognition.
Today we are going to share our experience on how to do object
detection in images using convolutional neural networks. This type of
neural networks has successfully proven itself in our past projects.
Project Overview
In the past, the Haar cascade classifier and the LBP-based classifier
were the best tools for detecting objects in images. However with the introduction of convolutional neural networks and their proven successful application in computer vision, these cascade classifiers have become a second-best alternative.
We decided to test in practice the effectiveness of convolutional
neural networks for object detection in images. As the object of our
research, we chose license plate and road sign pictures.
The goal of our project was to develop a
convolutional neural network model that allows recognition of objects in
images with a higher quality and performance than cascade classifiers.
The task is broken into four stages:
1. License plate keypoints detection using a convolutional neural network
2. Road signs keypoints detection using a convolutional neural network
3. Road signs detection using a fully convolutional neural network
4. Comparing cascade classifiers and a convolutional neural network for the purpose of license plate detection
We implemented the first three stages simultaneously. Each of them
was the responsibility of a single engineer. At the end, we compared the
final model of the convolutional neural network of the first stage with
cascade classifier methods according to specific parameters.
The performance of various recognition algorithms is particularly
important for contemporary mobile devices. For this reason, we decided
to test neural networks and cascade classifiers using smartphones.
Implementation
Stage 1. License plate keypoints detection using a convolutional neural network
Over the years, we have used machine learning for several research projects, and for image recognition we’ve often used a dataset of license plate numbers
as the learning base. Seeing as our new experiment required the
detection of specific identical objects in images, our license plate
database was perfectly suited to this task.
First, we decided to train the convolutional neural network to find
keypoints of license plates. This was the purpose of the regression
analysis. In our case, pixels of the image were independent input
parameters while the coordinates of object keypoints were the dependent
output parameters. The example of keypoints detection is available in
the Image 1.
Image 1: Detecting keypoints of license plates
Training a convolutional neural network to find keypoints requires a
dataset with a large number of images of the needed object (no less than
1000). Coordinates of keypoints have to be designated and located in
the same order.
Our dataset included several hundred images, however this wasn’t
enough for training the network. Therefore, we decided to increase the
dataset via augmentation of available data. Before starting augmentation
we designated keypoints in the images and divided our dataset into
training and control parts.
We applied the following transformations to the initial image during its augmentation:
Shifts
Resize
Rotations relative to the center
Mirror
Affine transformations (they allowed us to rotate and stretch a picture).
Besides this we changed all the images to 320*240 pixels. Take a look
at an example of augmentation with transformations in Image 2.
Image 2: Data augmentation
We chose the Caffe framework
for the first stage because it is one of the most flexible and fastest
frameworks for experiments with convolutional neural networks.
One way to solve a regression task in Caffe is using the special file
format HDF5. After normalization of pixel values from 0 to 1 and
coordinate values from -1 to 1, we packed the images and coordinates of
keypoints in HDF5. You can find more details in our tutorial (see
below).
At the beginning, we applied large network architectures (from 4 to 6
convolutional layers and a large amount of convolution kernels). The
models with big architectures demonstrated good results but very low
performance. For this reason, we decided to set up a simple neural
network architecture to keep the quality on the same level.
The final architecture of the convolutional neural network for detecting keypoints of license plates was the following:
Image 3: The architecture of convolutional neural network for detecting the keypoints of license plates
While training the neural network we used the optimization method called Adam.
Compared to the Nesterov’s gradient descent, which demonstrated a high
value of loss even after the thorough selection of momentum and learning
rate parameters, Adam showed the best results. Using the Adam method,
the convolutional neural network was trained with higher quality and
speed.
We got a neural network that finds the key points of license plates quite well if the plates are not very close to the borders.
Image 4: Detecting the key points of license plates with the obtained convolutional neural network model
For deeper understanding of the convolutional neural network
principle, we studied the obtained convolution kernels and feature maps
on different layers of the neural network. Considering the final model,
convolution kernels demonstrated that the network learned to respond to
the sudden swings of brightness, which appear in the borders and symbols
of a license plate. Feature maps on the images below show the received
model.
The trained kernels on the first and second layers are in pictures 5 and 6.
Image 5: Obtained kernel of the first convolutional layer 7х7
Image 6: Obtained kernel of the second convolutional layer 5х5
Regarding the feature maps in the final model, we took the car
picture (Image 7) and transformed it into the picture with gray color
gradation to look at the obtained maps after the first (Image 8) and the
second (Image 9) convolutional layers.
Image 7: The car picture that is given to the network for viewing feature maps
Image 8: The feature map after the first convolutional layer
Image 9: The feature maps after the second convolutional layer
Finally, we designated the received coordinates of the key points and got the desired image (Image 10).
Image 10: Designated picture after direct passage of convolutional neural network
The convolutional neural network was very effective in detecting the
keypoints of license plates. In the majority of cases, the key points of
license plate borders were recognized correctly. Therefore, we can
highly praise the productivity of the convolutional neural network. The
example of license plate detection using an iPhone is available in the
video.
Stage 2. Road sign keypoint detection using a convolutional neural network
While training a convolutional neural network to find license plates
we simultaneously worked on training a similar network to find road
signs with speed limits. We also implemented experiments on the base of
the Caffe framework. For training this model we used a dataset with
nearly 1700 pictures of signs, which were augmented to 35000.
In doing this we successfully trained a neural network to find road signs on images with a size of 200х200 pixels.
When we tried to detect road signs with a different size on the image
the problem appeared. For this reason, we implemented experiments based
on simple conditions. A network had to find a white circle against a
dark backdrop. We also kept the same image size of 200x200 pixels and
made the fixed parameters for the circle size in variation up to 3
times. In other words, the minimum and maximum radius of circles in our
dataset differed from each other by 3 times.
Finally, we achieved an appropriate outcome only when the radius of
the circle changed no more than 10%. Experiments with gray circles (the
spectral range of gray from 0.1 to 1) also demonstrated the same result.
Thus, it is an open question as to how to implement object detection
when the objects have a different size.
Examples of the last successfully tested model are shown on Image 11
as a group of pictures. As we can see in the pictures, the network
learned to distinguish between similar signs. If the network pointed at
the image center, then it didn’t find an object.
Image 11: Results of the convolutional neural network learning
Regarding the detection of road signs, the convolutional neural
network demonstrated good results. However, different sizes of objects
became an unexpected obstacle. We plan to come back to the search of the
solution to this problem in future projects, as it requires detailed
research.
Stage 3. Road sign detection using a fully convolutional neural network
Another model that we decided to train to find road signs was a fully
convolutional neural network without fully-connected layers. There is
an image of a specific size at the input of the fully convolutional
neural network, which transforms to a smaller size image at the output.
In fact, the network is a non-linear filter with resizing. In other
words, the neural network helps to increase the sharpness by removing
noise from particular image areas without edge smearing but at the cost
of reducing the size of the input image.
The brightness value of pixels is equal to 1 in the output image,
where an object is. And any pixels outside of the image have a
brightness value equal to 0. Therefore, the brightness value of pixels
in the output image is the probability of that pixel belonging to the
object.
It is important to consider that the output image is smaller than the
input image. That’s why the object coordinates have to be scaled in
accordance with the output image size.
We chose the following architecture to train the network: 4
convolutional layers, max-pooling (reducing size via choice of the
biggest one) for the first layer. We trained the network using a dataset
which is based on images of road signs with speed limits. When we
finished training and applied the network to our dataset, we made
binarization and found the connected components. Every component is a
hypothesis about the sign location. Here are the results:
Image 12: Examples of successfully found road signs
We divided images into 4 groups of 3 pictures (see Image 12). The
left side of the image is the input. The right side of the image is the
output. We labeled the central image, which shows the result we would
like to get ideally.
We applied the binarization to the input image with a threshold of
0.5 and found the connected components. As a result, we found the
desired rectangle that designates the location of a road sign. The
rectangle is well seen in the left image and the coordinates are scaled
back to the input image size.
Nevertheless, this method demonstrated both good results and false positives:
Image 13: Examples of false positives
Overall, we evaluate this model positively. The fully convolutional
neural network showed about 80% of correctly found signs from the
independent testing sample. The special advantage of this model is that
you can find the same two signs and label them with a rectangle. If
there are no signs in the picture, the network won’t mark anything.
Stage 4. Comparing cascade classifiers and a convolutional neural network for the purpose of license plate detection
Through our earlier experiments we came to the conclusion that
convolutional neural networks are fully comparable with cascade
classifiers and even outperform them for some parameters.
To evaluate the quality and performance of different methods for
detecting objects on images we use the following characteristics:
• Level of precision and recall
Both the convolutional neural networks and Haar classifier
demonstrate a high level of precision and recall for detecting objects
in images. At the same time, the LBP classifier shows a high level of
recall (finding the object quite regularly) but also has a high rate of
false positives and low precision.
• Scale invariance
Whereas the Haar cascade classifier and the LBP cascade classifier
demonstrate strong invariance to changing the scale of the object in the
images, the convolutional neural network cannot manage it in some cases
and this shows the low scale invariance.
• Number of attempts before getting a working model
Using cascade classifiers, you need a few attempts to get a working
model for object detection. The convolutional neural networks do not
give a result so quickly. To achieve the goal you need to perform dozens
of experiments.
• Time for processing
A convolutional neural network does not require much time for
processing. And the LBP classifier also doesn’t need a lot of processing
time. As for the Haar classifier, it takes a significantly longer time
for processing.
The average time spent on processing one picture in seconds (not counting the time for capturing and displaying video):
• Consistency with tilting objects
Another great advantage of the convolutional neural network is the
consistency with tilting objects. Neither cascade classifiers are
consistent with objects which are tilting on the image.
To summarize, we can say that convolutional neural networks are equal
or even better than cascade classifiers for some parameters. However
the conditions are that there will be a significant number of
experiments required to teach the neural network and that the object
scale can’t change a lot.
Conclusion
During the process of solving the problem of detecting specific
objects in images, we applied two models based on convolutional neural
networks. The first one is finding the object’s keypoints in images
using a convolutional neural network. The second one is finding the
objects in images via a fully convolutional neural network.
Each of the experiments were very labor-intensive and time-consuming.
Once more we were convinced that the process of training convolutional
neural networks is complicated and requires more investment to obtain
reliable results of high quality and performance.
The comparative analysis of cascade classifier methods and the
trained convolutional neural network model confirmed our main
hypothesis. The convolutional neural network allows localizing objects
faster and with higher quality than cascade classifiers if the object
won’t change in scale very much. To solve the problem of the low scale
invariance, we will try to increase the number of convolutional layers
in future projects and use the most representative dataset.
If you are interested in the topic of our research project, take a
look at the tutorial we made. Using our tips, you can train a
convolutional neural network to detect the objects in images. We will be
glad to hear from you about your own experiments.
At Instacart, we are revolutionizing the way people buy
groceries. We give busy professionals, parents and elderly back valuable
time they don’t have to spend shopping. We also give flexible work
opportunities to thousands of personal shoppers, and we extend the reach
and sales volume for our hundreds of retail partners.
We work incredibly hard to make Instacart easy to use. Our site and
app are intuitive – you fill your shopping cart, pick the hour you want
delivery to occur in, and then the groceries are handed to you at your
doorstep. But achieving this simplicity cost effectively at scale
requires an enormous investment in engineering and data science.
What are a few of the teams where data science plays a critical role at Instacart?
Fulfillment
At its core, Instacart is a real-time logistics platform. We are in
the business of moving goods from A (a store) to B (your front door) as
efficiently and predictably as we can. At any given time, for every
market we operate in, we can have thousands of customers expecting
delivery in the coming hours. We also can have thousands of shoppers
actively working or waiting to be dispatched through our mobile shopper
application.
Our fulfillment algorithm decides in real time how to route those
shoppers to store locations to pick groceries and deliver them to
customers’ door-steps in as little as one hour. We re-compute this
optimization every minute, because the world is constantly changing. We
have to balance speed (some shoppers shop faster, some stores are less
busy) with efficiency (can we deliver multiple orders simultaneously)
with quality (does the customer get the exact groceries they want) and
timeliness (is the order delivered within the hour it is due – no
earlier, no later). Example of near-optimal combinations of orders for our drivers to deliver (noise added to addresses to protect privacy)
Optimizing multiple objectives while routing thousands of shoppers
every minute to fulfill millions of orders is a tremendous data science
challenge.
Supply & Demand
Instacart operates a very dynamic and complex fulfillment
marketplace. Our consumers place orders (demand) and our shoppers
fulfill those orders (supply) in as little as an hour. If supply exceeds
demand in a market, we lose money and reduce shopper happiness due to
shoppers sitting idle. If instead demand exceeds supply in a market, we
lose revenue and customers due to limited availability and busy pricing.
Our shoppers work with us to make money, and so they will only be happy
if they’re able to keep busy. On the other side, our customers change
their lifestyles because of our product, and so we need to be there for
them when they want us. Jeremy and Morgane discussing demand forecasting
Balancing supply and demand requires sophisticated systems for
forecasting customer and shopper behavior down to individual store
locations by hour of day many days into the future. We then create
staffing plans that blend multiple different labor role types to
optimize our efficiency while ensuring high availability for our
customers. This is made even more challenging by the many different
queues we must manage across stores and division of labor. Then in real
time, we have to estimate our capacity for orders every time a user
visits our site or one of our apps, and then dynamically control
availability and busy pricing to smooth demand and create the optimal
customer experience.
These systems operate over multiple time horizons, have to solve for
multiple competing objectives, and control for many erratic sources of
variation (shopper behavior, weather, special events, etc.). We will
always have huge opportunities to make improvements here. Search & Personalization
Instacart isn’t just grocery delivery, we’re creating a better
grocery shopping experience. A majority of grocery shopping is about
finding the food on your list. In a traditional grocery store, the
search engine is the customer’s two feet. At Instacart, it’s a massive
algorithm that can mine billions of searches to ensure every product a
customer wants is at the edge of their fingertips.
At a physical grocery store, you have to discover new products on
your own. But at Instacart, we can curate the experience for you through
personalization. What could be more personal than food? We have an
intimate relationship with it every day – we put it in our bodies!
As much as movie recommendations were critical to the success of
Netflix, so too are product recommendations critical to Instacart. The search team – Sharath, Vincent, Raj and Jon from left
Our consumers order large basket sizes of diverse foods over and over
and over again from us. We have more density on our user behavior than
any e-commerce company I have ever seen. We are just beginning to use
that data to provide incredibly valuable personalized experiences for
our users on Instacart in search, in product discovery and in
suggestions we make to our users. We A/B test everything, and are
thinking really hard about the long term impacts of our changes.
Through investments in search and personalization, Instacart has the
opportunity to go beyond convenience in shopping online, and into a
future where everyone finds more food they love faster.
How does data science work at Instacart?
We have made the conscious decision to embed our data scientists into
our product teams, side-by-side with their engineers, designers and
product managers and reporting into the engineering leader for the team.
So to answer this question, you first have to understand how
engineering works at Instacart.
At Instacart, we place a high value on ownership, speed and
ultimately shipping products that have a huge measure-able impact. In
engineering, we have organized to optimize for these values. We have
many product teams, each of which have full-stack and/or mobile
developers, designers, analysts, product managers and engineering
leaders dedicated to them. Some teams are only 3 people, others are up
to 10. Each team completely owns their ‘product’, and defines their key
metrics and sets their roadmap weekly.
We align all of these teams to a small (three or fewer) set of
company wide goals that are updated whenever they are achieved or
exceeded. These company goals are concise, measurable and time-bound
objectives set by our board and executive team that the entire company
is committed to. We are obsessively focused on them, and are incredibly
transparent about our status and progress on these goals – our CEO sends
detailed notes on each weekly.
So every product team answers the question every week “what can we do
to have the biggest impact on our company’s goals this week?”. They are
then empowered to do whatever they need to within their
product to achieve those goals. It’s their ideas, their creativity,
their collaboration, their resourcefulness, and their hard work that
really moves the needle. Jeremy presenting on visualization at a Friday engineering lunch
For technology companies, data science can either be an integral
component to huge value creation, or an expensive and distracting hobby.
Many factors determine the outcome, but how you organize your data
scientists is one of the biggest contributing factors. By embedding our
data scientists into product teams, we’ve ensured that they are as
integral a part of their teams as they can be. As the VP of data
science, it’s my job to make sure that the data scientists stay
connected, have the mentorship they need, and are having the biggest
impact they can within their teams.
The data scientists have a tremendous amount of traction in this
model. Their ideas can directly shape not only product innovation, but
also data collection and infrastructure innovation to fuel future
product ideas. They work directly with their team to bring their
products the ‘last mile’ to production. This lets data scientists put
new ideas into production in days (from inception), and to rapidly
iterate on those ideas as they receive feedback from their consumers.
This also gives data scientists a holistic a view of their product, and
helps to ensure they are optimizing for the right objectives as
effectively as possible.
What are some areas Instacart is expecting to invest in data science in the near future?
Shoppers
Our shoppers are very important to our company. They shop for our
customers in the stores, communicate with them live to resolve any
issues, and bring the food to their doorstep thousands of times every
hour. We can use data science to optimize how we on-board these shoppers
and ensure they are successful. We can also optimize and personalize
our shopper application to ensure our shoppers can do their jobs quickly
and effectively. Partners
In many companies, advertising is a necessary evil. At Instacart, we
have been able to integrate advertising in a way that is a clear win for
the advertiser, for the customer and Instacart! Our Deals
program lets consumer packaged goods companies offer discounts to our
customers (they love them!). Ensuring that customers see the deals they
would be most interested in, and that the advertisers get a high ROI for
their spend is a huge data science opportunity for Instacart.
What do you look for in Data Scientists when recruiting?
Our organizational structure works because we have amazing talent.
You can’t move as fast as we do, with as much distributed ownership as
we have, all while solving challenges like ours without the right
people. Bala, Mathieu and Sherin discussing batching (from left)
Our values form the corner-stone of our culture, and these in particular are key for hiring data scientists: Customer Focus
“Everything we do is in service to our customers. We will work
tirelessly to gain the trust of our customers, and to improve their
lives. This is the first priority for everyone at Instacart.”
We seek to understand the problems we work on as holistically as we
can, and to reason through the physics of the system and how our many
constituents (consumers, shoppers, our partners) will experience the
changes we drive. We look for candidates that naturally think about
problems from a “first principles” basis for what is best for the end
user. Take Ownership
“We will take full ownership of our projects. We take pride in our
work and relentlessly execute to get things completely finished.”
In data science, this means improving algorithms and analyzing data
are never enough. We own the problem, the solution, the implementation
and the measurement – along with everyone else on our team. Simply put,
until the desired impact has been measured, our work isn’t done. We look
for candidates who crave this opportunity for impact. Sense of Urgency
“We work extremely fast to drive our projects to completion and we will not rest until they are done.”
Many data science teams think about impact in quarters, months or
weeks. Our teams regularly iterate on hard problems in a matter of days –
from R&D to implementation and measurement. We look for candidates
with a bias towards action, and the fortitude to pursue aggresive goals
relentlessly. Highest Standards
“We put our heart and soul into the projects to deliver the highest
quality work product. We only produce work that we are proud of.”
With ownership and a mandate for urgency comes a great responsibility
– we must maintain the highest standards possible for the work we
produce, as it has the potential to impact millions of consumers,
thousands of shoppers and hundreds of retail partners. We look for
exceptional candidates who can do amazing work, and are always seeking
better ways – be they new algorithms, new processes or new
implementations. Humility
“We appreciate that great ideas can come from anywhere and we will be
humble and open minded in considering the ideas of others.”
Many of our best data science ideas have come from Instacart
employees in the field – working directly with our shoppers in our
stores, or interacting directly with our customers. Ensuring our eyes
are wide open to these ideas, and that we collaborate openly within our
teams and are always open to questioning our biases and assumptions is
critically important. We look for candidates who are conscious of their
limitations, and always open to the ideas of others – wherever those
ideas may come from.
What Data Science roles is Instacart recruiting for?
We are looking for data scientists with expertise in forecasting, predictive modeling, ads optimization, search and recommendations. We are also looking for operations research scientists
with expertise in planning, logistics and real time control systems.
Our team uses Python, R, SQL (Postgres & Redshift) and Spark
extensively, so mastery of some of those tools and technologies is also
helpful.
Posted by Noah Fiedel, Software Engineer
Machine learning powers many Google product features, from speech recognition in the Google app to Smart Reply in Inbox to search in Google Photos.
While decades of experience have enabled the software industry to
establish best practices for building and supporting products, doing so
for services based upon machine learning introduces new and interesting challenges.
Today, we announce the release of TensorFlow Serving,
designed to address some of these challenges. TensorFlow Serving is a
high performance, open source serving system for machine learning
models, designed for production environments and optimized for TensorFlow.
TensorFlow Serving is ideal for running multiple models, at large scale, that change over time based on real-world data, enabling:
model lifecycle management
experiments with multiple algorithms
efficient use of GPU resources
TensorFlow Serving makes the process of taking a model into
production easier and faster. It allows you to safely deploy new models
and run experiments
while keeping the same server architecture and APIs. Out of the box it
provides integration with TensorFlow, but it can be extended to serve
other types of models.
Here’s how it works. In the simplified, supervised training pipeline
shown below, training data is fed to the learner, which outputs a model:
Once a new model version becomes available, upon validation, it is ready to be deployed to the serving system, as shown below.
TensorFlow
Serving uses the (previously trained) model to perform inference -
predictions based on new data presented by its clients. Since clients
typically communicate with the serving system using a remote procedure call (RPC) interface, TensorFlow Serving comes with a reference front-end implementation based on gRPC, a high performance, open source RPC framework from Google.
It is quite common to launch and iterate on your model over time, as new
data becomes available, or as you improve the model. In fact, at
Google, many pipelines run continuously, producing new model versions as
new data becomes available.
TensorFlow
Serving is written in C++ and it supports Linux. TensorFlow Serving
introduces minimal overhead. In our benchmarks we recoded ~100,000
queries per second (QPS) per core on a 16 vCPU Intel Xeon E5 2.6 GHz machine, excluding gRPC and the TensorFlow inference processing time.
We are excited to share this important component of TensorFlow today
under the Apache 2.0 open source license. We would love to hear your questions and feature requests on Stack Overflow and GitHub respectively. To get started quickly, clone the code from github.com/tensorflow/serving and check out this tutorial.
You can expect to keep hearing more about TensorFlow as we continue to
develop what we believe to be one of the best machine learning toolboxes
in the world. If you'd like to stay up to date, follow @googleresearch or +ResearchatGoogle, and keep an eye out for Jeff Dean's keynote address at GCP Next 2016 in March.
Researchers
at the University of Zurich, the Università della Svizzera italiana,
and the University of Applied Sciences and Arts of Southern Switzerland
have developed software enabling drones to autonomously detect and
follow forest paths. With the new drones, missing persons can be found
and rescued quickly in forests and mountain areas.
Every year,
thousands of people lose their way in forests and mountain areas. In
Switzerland alone, emergency centers respond to around 1,000 calls
annually from injured and lost hikers. But drones can effectively
complement the work of rescue services teams. Because they are
inexpensive and can be rapidly deployed in large numbers, they
substantially reduce the response time and the risk of injury to missing
persons and rescue teams alike.
A group of researchers from the Dalle Molle Institute for Artificial Intelligence and the University of Zurich has developed artificial intelligence
software to teach a small quadrocopter to autonomously recognize and
follow forest trails. A premiere in the fields of artificial
intelligence and robotics, this success means drones could soon be used
in parallel with rescue teams to accelerate the search for people lost
in the wild. Breakthrough: Drone Flies Autonomously in Demanding Terrain
"While drones flying at high altitudes are already being used
commercially, drones cannot yet fly autonomously in complex
environments, such as dense forests. In these environments, any little
error may result in a crash, and robots need a powerful brain in order
to make sense of the complex world around them," says Prof. Davide
Scaramuzza from the University of Zurich.
The drone used by the Swiss researchers observes the environment
through a pair of small cameras, similar to those used in smartphones.
Instead of relying on sophisticated sensors, their drone uses very
powerful artificial-intelligence algorithms to interpret the images to
recognize man-made trails. If a trail is visible, the software steers
the drone in the corresponding direction. "Interpreting an image taken
in a complex environment such as a forest is incredibly difficult for a
computer," says Dr. Alessandro Giusti from the Dalle Molle Institute for
Artificial Intelligence. "Sometimes even humans struggle to find the
trail!" Successful Deep Neural Network Application
The Swiss team solved the problem using a so-called Deep Neural
Network, a computer algorithm that learns to solve complex tasks from a
set of "training examples," much like a brain learns from experience. In
order to gather enough data to "train" their algorithms, the team hiked
several hours along different trails in the Swiss Alps and took more
than 20 thousand images of trails using cameras attached to a helmet.
The effort paid off: When tested on a new, previously unseen trail, the deep neural network
was able to find the correct direction in 85% of cases; in comparison,
humans faced with the same task guessed correctly 82% of the time.
Professor Juergen Schmidhuber, Scientific Director at the Dalle Molle
Institute for Artificial Intelligence says: "Our lab has been working
on deep learning in neural networks since the early 1990s. Today I am
happy to find our lab's methods not only in numerous real-world
applications such as speech recognition on smartphones, but also in
lightweight robots such as drones. Robotics will see an explosion of
applications of deep neural networks in coming years."
The research team warns that much work is still needed before a fully
autonomous fleet will be able to swarm forests in search of missing
people. Professor Luca Maria Gambardella, director of the "Dalle Molle
Institute for Artificial Intelligence" in Lugano remarks: "Many
technological issues must be overcome before the most ambitious
applications can become a reality. But small flying robots are
incredibly versatile, and the field is advancing at an unseen pace. One
day robots will work side by side with human rescuers to make our lives
safer." Prof. Davide Scaramuzza from the University of Zurich adds: "Now
that our drones have learned to recognize and follow forest trails, we must teach them to recognize humans."
More information:
Alessandro Giusti et al. A
Machine Learning Approach to Visual Perception of Forest Trails for
Mobile Robots, IEEE Robotics and Automation Letters (2015). DOI: 10.1109/LRA.2015.2509024
Researchers
at the University of Zurich, the Università della Svizzera italiana,
and the University of Applied Sciences and Arts of Southern Switzerland
have developed software enabling drones to autonomously detect and
follow forest paths. With the new drones, missing persons can be found
and rescued quickly in forests and mountain areas.
Every year,
thousands of people lose their way in forests and mountain areas. In
Switzerland alone, emergency centers respond to around 1,000 calls
annually from injured and lost hikers. But drones can effectively
complement the work of rescue services teams. Because they are
inexpensive and can be rapidly deployed in large numbers, they
substantially reduce the response time and the risk of injury to missing
persons and rescue teams alike.
A group of researchers from the Dalle Molle Institute for Artificial Intelligence and the University of Zurich has developed artificial intelligence
software to teach a small quadrocopter to autonomously recognize and
follow forest trails. A premiere in the fields of artificial
intelligence and robotics, this success means drones could soon be used
in parallel with rescue teams to accelerate the search for people lost
in the wild. Breakthrough: Drone Flies Autonomously in Demanding Terrain
"While drones flying at high altitudes are already being used
commercially, drones cannot yet fly autonomously in complex
environments, such as dense forests. In these environments, any little
error may result in a crash, and robots need a powerful brain in order
to make sense of the complex world around them," says Prof. Davide
Scaramuzza from the University of Zurich.
The drone used by the Swiss researchers observes the environment
through a pair of small cameras, similar to those used in smartphones.
Instead of relying on sophisticated sensors, their drone uses very
powerful artificial-intelligence algorithms to interpret the images to
recognize man-made trails. If a trail is visible, the software steers
the drone in the corresponding direction. "Interpreting an image taken
in a complex environment such as a forest is incredibly difficult for a
computer," says Dr. Alessandro Giusti from the Dalle Molle Institute for
Artificial Intelligence. "Sometimes even humans struggle to find the
trail!" Successful Deep Neural Network Application
The Swiss team solved the problem using a so-called Deep Neural
Network, a computer algorithm that learns to solve complex tasks from a
set of "training examples," much like a brain learns from experience. In
order to gather enough data to "train" their algorithms, the team hiked
several hours along different trails in the Swiss Alps and took more
than 20 thousand images of trails using cameras attached to a helmet.
The effort paid off: When tested on a new, previously unseen trail, the deep neural network
was able to find the correct direction in 85% of cases; in comparison,
humans faced with the same task guessed correctly 82% of the time.
Professor Juergen Schmidhuber, Scientific Director at the Dalle Molle
Institute for Artificial Intelligence says: "Our lab has been working
on deep learning in neural networks since the early 1990s. Today I am
happy to find our lab's methods not only in numerous real-world
applications such as speech recognition on smartphones, but also in
lightweight robots such as drones. Robotics will see an explosion of
applications of deep neural networks in coming years."
The research team warns that much work is still needed before a fully
autonomous fleet will be able to swarm forests in search of missing
people. Professor Luca Maria Gambardella, director of the "Dalle Molle
Institute for Artificial Intelligence" in Lugano remarks: "Many
technological issues must be overcome before the most ambitious
applications can become a reality. But small flying robots are
incredibly versatile, and the field is advancing at an unseen pace. One
day robots will work side by side with human rescuers to make our lives
safer." Prof. Davide Scaramuzza from the University of Zurich adds: "Now
that our drones have learned to recognize and follow forest trails, we must teach them to recognize humans."
More information:
Alessandro Giusti et al. A
Machine Learning Approach to Visual Perception of Forest Trails for
Mobile Robots, IEEE Robotics and Automation Letters (2015). DOI: 10.1109/LRA.2015.2509024
Researchers
at the University of Zurich, the Università della Svizzera italiana,
and the University of Applied Sciences and Arts of Southern Switzerland
have developed software enabling drones to autonomously detect and
follow forest paths. With the new drones, missing persons can be found
and rescued quickly in forests and mountain areas.
Every year,
thousands of people lose their way in forests and mountain areas. In
Switzerland alone, emergency centers respond to around 1,000 calls
annually from injured and lost hikers. But drones can effectively
complement the work of rescue services teams. Because they are
inexpensive and can be rapidly deployed in large numbers, they
substantially reduce the response time and the risk of injury to missing
persons and rescue teams alike.
A group of researchers from the Dalle Molle Institute for Artificial Intelligence and the University of Zurich has developed artificial intelligence
software to teach a small quadrocopter to autonomously recognize and
follow forest trails. A premiere in the fields of artificial
intelligence and robotics, this success means drones could soon be used
in parallel with rescue teams to accelerate the search for people lost
in the wild. Breakthrough: Drone Flies Autonomously in Demanding Terrain
"While drones flying at high altitudes are already being used
commercially, drones cannot yet fly autonomously in complex
environments, such as dense forests. In these environments, any little
error may result in a crash, and robots need a powerful brain in order
to make sense of the complex world around them," says Prof. Davide
Scaramuzza from the University of Zurich.
The drone used by the Swiss researchers observes the environment
through a pair of small cameras, similar to those used in smartphones.
Instead of relying on sophisticated sensors, their drone uses very
powerful artificial-intelligence algorithms to interpret the images to
recognize man-made trails. If a trail is visible, the software steers
the drone in the corresponding direction. "Interpreting an image taken
in a complex environment such as a forest is incredibly difficult for a
computer," says Dr. Alessandro Giusti from the Dalle Molle Institute for
Artificial Intelligence. "Sometimes even humans struggle to find the
trail!" Successful Deep Neural Network Application
The Swiss team solved the problem using a so-called Deep Neural
Network, a computer algorithm that learns to solve complex tasks from a
set of "training examples," much like a brain learns from experience. In
order to gather enough data to "train" their algorithms, the team hiked
several hours along different trails in the Swiss Alps and took more
than 20 thousand images of trails using cameras attached to a helmet.
The effort paid off: When tested on a new, previously unseen trail, the deep neural network
was able to find the correct direction in 85% of cases; in comparison,
humans faced with the same task guessed correctly 82% of the time.
Professor Juergen Schmidhuber, Scientific Director at the Dalle Molle
Institute for Artificial Intelligence says: "Our lab has been working
on deep learning in neural networks since the early 1990s. Today I am
happy to find our lab's methods not only in numerous real-world
applications such as speech recognition on smartphones, but also in
lightweight robots such as drones. Robotics will see an explosion of
applications of deep neural networks in coming years."
The research team warns that much work is still needed before a fully
autonomous fleet will be able to swarm forests in search of missing
people. Professor Luca Maria Gambardella, director of the "Dalle Molle
Institute for Artificial Intelligence" in Lugano remarks: "Many
technological issues must be overcome before the most ambitious
applications can become a reality. But small flying robots are
incredibly versatile, and the field is advancing at an unseen pace. One
day robots will work side by side with human rescuers to make our lives
safer." Prof. Davide Scaramuzza from the University of Zurich adds: "Now
that our drones have learned to recognize and follow forest trails, we must teach them to recognize humans."
More information:
Alessandro Giusti et al. A
Machine Learning Approach to Visual Perception of Forest Trails for
Mobile Robots, IEEE Robotics and Automation Letters (2015). DOI: 10.1109/LRA.2015.2509024
Researchers
at the University of Zurich, the Università della Svizzera italiana,
and the University of Applied Sciences and Arts of Southern Switzerland
have developed software enabling drones to autonomously detect and
follow forest paths. With the new drones, missing persons can be found
and rescued quickly in forests and mountain areas.
Every year,
thousands of people lose their way in forests and mountain areas. In
Switzerland alone, emergency centers respond to around 1,000 calls
annually from injured and lost hikers. But drones can effectively
complement the work of rescue services teams. Because they are
inexpensive and can be rapidly deployed in large numbers, they
substantially reduce the response time and the risk of injury to missing
persons and rescue teams alike.
A group of researchers from the Dalle Molle Institute for Artificial Intelligence and the University of Zurich has developed artificial intelligence
software to teach a small quadrocopter to autonomously recognize and
follow forest trails. A premiere in the fields of artificial
intelligence and robotics, this success means drones could soon be used
in parallel with rescue teams to accelerate the search for people lost
in the wild. Breakthrough: Drone Flies Autonomously in Demanding Terrain
"While drones flying at high altitudes are already being used
commercially, drones cannot yet fly autonomously in complex
environments, such as dense forests. In these environments, any little
error may result in a crash, and robots need a powerful brain in order
to make sense of the complex world around them," says Prof. Davide
Scaramuzza from the University of Zurich.
The drone used by the Swiss researchers observes the environment
through a pair of small cameras, similar to those used in smartphones.
Instead of relying on sophisticated sensors, their drone uses very
powerful artificial-intelligence algorithms to interpret the images to
recognize man-made trails. If a trail is visible, the software steers
the drone in the corresponding direction. "Interpreting an image taken
in a complex environment such as a forest is incredibly difficult for a
computer," says Dr. Alessandro Giusti from the Dalle Molle Institute for
Artificial Intelligence. "Sometimes even humans struggle to find the
trail!" Successful Deep Neural Network Application
The Swiss team solved the problem using a so-called Deep Neural
Network, a computer algorithm that learns to solve complex tasks from a
set of "training examples," much like a brain learns from experience. In
order to gather enough data to "train" their algorithms, the team hiked
several hours along different trails in the Swiss Alps and took more
than 20 thousand images of trails using cameras attached to a helmet.
The effort paid off: When tested on a new, previously unseen trail, the deep neural network
was able to find the correct direction in 85% of cases; in comparison,
humans faced with the same task guessed correctly 82% of the time.
Professor Juergen Schmidhuber, Scientific Director at the Dalle Molle
Institute for Artificial Intelligence says: "Our lab has been working
on deep learning in neural networks since the early 1990s. Today I am
happy to find our lab's methods not only in numerous real-world
applications such as speech recognition on smartphones, but also in
lightweight robots such as drones. Robotics will see an explosion of
applications of deep neural networks in coming years."
The research team warns that much work is still needed before a fully
autonomous fleet will be able to swarm forests in search of missing
people. Professor Luca Maria Gambardella, director of the "Dalle Molle
Institute for Artificial Intelligence" in Lugano remarks: "Many
technological issues must be overcome before the most ambitious
applications can become a reality. But small flying robots are
incredibly versatile, and the field is advancing at an unseen pace. One
day robots will work side by side with human rescuers to make our lives
safer." Prof. Davide Scaramuzza from the University of Zurich adds: "Now
that our drones have learned to recognize and follow forest trails, we must teach them to recognize humans."
More information:
Alessandro Giusti et al. A
Machine Learning Approach to Visual Perception of Forest Trails for
Mobile Robots, IEEE Robotics and Automation Letters (2015). DOI: 10.1109/LRA.2015.2509024
Hearing Heartbeat in Audio and Video: A Deep Learning Project
Monday, February 1, 2016
Last term, I ended up taking three machine learning courses, along
with a couple of others. Out of all of the stuff I did last term, the
biggest and best thing I did was work with a few friends on a group
project. This project was:
Automatic Estimation of Heart rate from Audio and Video
Now this might sound like nonsense to most people, "how can you
hear my heartrate? My heart doesn't beat that loudly! And you you can
detect my heart rate from video? witchcraft!?". In fact you would be almost right.
Several studies like this one
have shown that it is completely possible. In short: Your heartbeat
affects the nerve that operates your voicebox, and the change in
coloration of a person's face does indeed match a heartbeat (your face
becomes more red when your heart pumps blood to it).
So we weren't trying to analyse facial expression or physical
exertion to estimate heart rate, but rather the physical waveform
produced in your speech and facial coloration which correspond to a
persons heartbeat.
Why Heartrate?
1. Not Much Work Done in the Area
Currently, all of the heavyweight research into this area of machine
learning comes from companies who want to read their customers better,
rather than help them. This means that detecting emotional state has
already had extensive research and is already being used in software. Microsoft is already selling a service which detects the emotion of people in video or images.
2. Potential Benefit
There are a lots more theoretical (medical) uses for analysing speech and video of people's face. Some people theorise that depression can be detected in people (or at least help doctors with a diagnosis), heart arrhythmia and potentially a lot more.
Imagine combining these abilities into one product. In the future a
doctor could use something like this to help detect subtle symptoms in
the way a person presents themselves. A whole new quantification of
emotional wellbeing (which could be accurately recorded) would
completely change how so many areas of medicine work.
Being able to detect heart rate (and other stuff) from speech or video of a person's face has a whole bunch of benefits:
Remote: No one needs to be next to the patient to take measurements.
Unintrusive: No device needs to be placed on the patient's body.
Retroactive: It would be possible to look at past
audio and video recordings (that we have, like, a ton of) to help in
diagnosis. Videos of a patient as a child might reveal signs of a
symptom, or could provide new evidence in a court case.
What We Did
The final product was a website where you could upload a video (with a
person talking in it) and have it streamed back to you in real time
with the heart beat of the person in the video. We used both the audio
and video methods of estimating heart rate to get as good an estimate as
possible. Unfortunately we have not hosted it at the moment, as it
would require some powerful commodity hardware to run the server.
Here is a screenshot of our video streaming service. In retrospect I
think superimposing a big red square over the person's face looks a bit
tacky.
We were given a dataset of people talking about
embarrasing/happy/sad/funny moments of their life with their pulse
measured. Naturally this meant that the heart rate of the subject would
change during the session, as they would get more tense or relaxed
depending on what they were saying. Our job was to use this dataset to
train our models.
Most of my work on the project was video analysis, data preparation
and a little bit of help with the neural network training, so I'll show
you the stuff I know about in the project.
Enough talk, lets show you the interesting stuff that we did.
Face Tracker
For video analysis, the aim was to monitor the bloodflow in someone's
face. We did this by looking at the green intensity of the subjects'
face (based on this paper).
In order to do this, you need to track the position of a person's
face so you know which pixels are showing someone's face. This is a bit
more tricky than it sounds: most face trackers out there fail really,
really soft. What that means is that if the face tracker is having a
hard time finding the subject in the video, they usually then start
tracking some blip of dust on the wall, which would completely
contaminate the data we have. (look for a face tracker doing this).
Also, many of these face trackers are quite processor heavy, which
makes the dream of using this in a real time streaming service very
difficult. Some of these trackers could run at around 10fps on a desktop
machine, which is just way too slow.
So instead we used the fast and crappy face tracker in OpenCV. The
problem with this face tracker is that it is completely memoryless. This
means that it doesn't take into account the previous frames to find the
face, which makes it a very wobbly tracker. Also, even when it looked
like the face tracker was tracking the subject perfectly, it would drop
an odd frame, which would flag up the subject as difficult to track, and
move on to the next one.
To solve the wobbliness, we added a layer on top of the tracker which
uses moving averages for the size and position of the face to smooth
out the tracking. The frame loss was solved by giving the adding a ten
frame drop limit, so if the tracker loses the face in the next frame, it
uses the position of the previous frame for collecting the green
intensity for up to ten consecutive frames.
Here you can see an example of the face tracker working on a subject
in our training data. Note the green square is where the face tracker
thinks the face is, and the cyan rectangle is the region of pixels
relative to the face-tracker which the green intensity is extracted from
the face.
Neural Nets
We used keras for
making and running our deep neural nets. If you are curious to try some
deep learning, this is all of the benefit of high performance CUDA in a
nicely made python framework.
These were the four different approaches to audio we attempted. For video, we only used the CNN-RNN architecture.
First Idea
We initially tried to apply a similar strategy to Dieleman and Schrauwen, who used spectrograms
of subjects speaking to detect heart rate. For video, we used the time
series data of the green intensity of someone's face to create the
spectrograms, here are what the spectrograms looked like for audio and
video.
The idea is that convolutional neural nets are very good at image
classification, and applying these practices to look at a spectrogram
might be a good way at detecting heart rate.
This was not very successful for us. Our best result was 0.68 product moment correlation coefficient.
Recurrent Nets
We then tried recurrent neural networks, which proved to be much
better at detecting heart rate, but still not great. For audio we sliced
the spectrograms into small chunks, and for video, we just used the
time series of green intensity of the face. These turned out to be much
more promising, but still not as good as we hoped.
Thoughts
Can you detect heart rate from a video of a person?
Sometimes in research based projects the answer can be as simple as a
yes or no, but in our case, we can't really tell. There are so many
factors which could have contaminated the data a little bit, such as
some of the subjects wearing make-up, which would block coloration of
the face coming through. I do believe that this is an avenue of research
that could really do with more people interested in it, it has the
potential to have a lot of commercial and medical impact.
This glossary is work in progress and I am planning to
continuously update it. If you find a mistake or think an important term
is missing, please let me know in the comments or via email.
Deep Learning terminology can be quite overwhelming to newcomers.
This glossary tries to define commonly used terms and link to original
references and additional resources to help readers dive deeper into a
specific topic.
The boundary between what is Deep Learning vs. “general” Machine
Learning terminology is quite fuzzy. I am trying to keep the glossary
specific to Deep Learning, but these decisions are somewhat arbitrary.
For example, I am not including “cross-validation” here because it’s a
generic technique uses all across Machine Learning. However, I’ve
decided to include terms such as softmax or word2vec because they are often associated with Deep Learning even though they are not Deep Learning techniques.
Activation Function
To allow Neural Networks to learn complex decision boundaries, we
apply a nonlinear activation function to some of its layers. Commonly
used functions include sigmoid, tanh, ReLU (Rectified Linear Unit) and variants of these.
Adadelta
Adadelta is a gradient descent based learning algorithm that adapts
the learning rate per parameter over time. It was proposed as an
improvement over Adagrad, which is more sensitive to hyperparameters and may decrease the learning rate too aggressively. Adadelta It is similar to rmsprop and can be used instead of vanilla SGD.
Adagrad is an adaptive learning rate algorithms that keeps track of
the squared gradients over time and automatically adapts the learning
rate per-parameter. It can be used instead of vanilla SGD and is
particularly helpful for sparse data, where it assigns a higher learning
rate to infrequently updated parameters.
Adam is an adaptive learning rate algorithm similar to rmsprop, but updates are
directly estimated using a running average of the first and second
moment of the gradient and also include a bias correction term.
A fully-connected layer in a Neural Network. Affine means that each
neuron in the previous layer is connected to each neuron in the current
layer. In many ways, this is the “standard” layer of a Neural Network.
Affine layers are often added on top of the outputs of Convolutional Neural Networks or Recurrent Neural Networks before making a final prediction. An affine layer is typically of the form y = f(Wx + b) where x are the layer inputs, W the parameters, b a bias vector, and f a nonlinear activation function
Attention Mechanism
Attention Mechanisms are inspired by human visual attention, the
ability to focus on specific parts of an image. Attention mechanisms can
be incorporated in both Language Processing and Image Recognition
architectures to help the network learn what to “focus” on when making
predictions.
Alexnet is the name of the Convolutional Neural Network architecture
that won the ILSVRC 2012 competition by a large margin and was
responsible for a resurgence of interest in CNNs for Image Recognition.
It consists of five convolutional layers, some of which are followed by
max-pooling layers, and three fully-connected layers with a final
1000-way softmax. Alexnet was introduced in ImageNet Classification with Deep Convolutional Neural Networks.
Autoencoder
An Autoencoder is a Neural Network model whose goal is to predict the
input itself, typically through a “bottleneck” somewhere in the
network. By introducing a bottleneck, we force the network to learn a
lower-dimensional representation of the input, effectively compressing
the input into a good representation. Autoencoders are related to PCA
and other dimensionality reduction techniques, but can learn more
complex mappings due to their nonlinear nature. A wide range of
autoencoder architectures exist, including Denoising Autoencoders, Variational Autoencoders, or Sequence Autoencoders.
Average-Pooling
Average-Pooling is a pooling
technique used in Convolutional Neural Networks for Image Recognition.
It works by sliding a window over patches of features, such as pixels,
and taking the average of all values within the window. It compresses
the input representation into a lower-dimensional representation.
Backpropagation
Backpropagation is an algorithm to efficiently calculate the
gradients in a Neural Network, or more generally, a feedforward
computational graph. It boils down to applying the chain rule of
differentiation starting from the network output and propagating the
gradients backward. The first uses of backpropagation go back to Vapnik
in the 1960’s, but Learning representations by back-propagating errors is often cited as the source.
Backpropagation Through Time (paper) is the Backpropagation algorithm
applied to Recurrent Neural Networks (RNNs). BPTT can be seen as the
standard backpropagation algorithm applied to an RNN, where each time
step represents a layer and the parameters are shared across layers.
Because an RNN shares the same parameters across all time steps, the
errors at one time step must be backpropagated “through time” to all
previous time steps, hence the name. When dealing with long sequences
(hundreds of inputs), a truncated version of BPTT is often used to
reduce the computational cost. Truncated BPTT stops backpropagating the
errors after a fixed number of steps.
Batch Normalization is a technique that normalizes layer inputs per
mini-batch. It speed up training, allows for the usage of higher learner
rates, and can act as a regularizer. Batch Normalization has been found
to be very effective for Convolutional and Feedforward Neural Networks
but hasn’t been successfully applied to Recurrent Neural Networks.
A Bidirectional Recurrent Neural Network is a type of Neural Network that contains two RNNs
going into different directions. The forward RNN reads the input
sequence from start to end, while the backward RNN reads it from end to
start. The two RNNs are stacked on top of each others and their states
are typically combined by appending the two vectors. Bidirectional RNNs
are often used in Natural Language problems, where we want to take the
context from both before and after a word into account before making a
prediction.
Caffe
is a deep learning framework developed by the Berkeley Vision and
Learning Center. Caffe is particularly popular and performant for vision
tasks and CNN models.
Categorical Cross-Entropy Loss
The categorical cross-entropy loss is also known as the negative log
likelihood. It is a popular loss function for categorization problems
and measures the similarity between two probability distributions,
typically the true labels and the predicted labels. It is given by L = -sum(y * log(y_prediction)) where y is the probability distribution of true labels (typically a one-hot vector) and y_prediction is the probability distribution of the predicted labels, often coming from a softmax.
Channel
Input data to Deep Learning models can have multiple channels. The
canonical examples are images, which have red, green and blue color
channels. A image can be represented as a 3-dimensional Tensor with the
dimensions corresponding to channel, height, and width. Natural Language
data can also have multiple channels, in the form of different types of
embeddings for example.
Convolutional Neural Network (CNN, ConvNet)
A CNN uses convolutions to connected extract features from local regions of an input. Most CNNs contain a combination of convolutional, pooling and affine
layers. CNNs have gained popularity particularly through their
excellent performance on visual recognition tasks, where they have been
setting the state of the art for several years.
DBNs are a type of probabilistic graphical model that learn a
hierarchical representation of the data in an unsupervised manner. DBNs
consist of multiple hidden layers with connections between neurons in
each successive pair of layers. DBNs are built by stacking multiple RBNs on top of each other and training them one by one.
A technique invented by Google that tries to distill the knowledge
captured by a deep Convolutional Neural Network. The technique can
generate new images, or transform existing images and give them a
dreamlike flavor, especially when applied recursively.
Dropout is a regularization technique for Neural Networks that
prevents overfitting. It prevents neurons from co-adapting by randomly
setting a fraction of them to 0 at each training iteration. Dropout can
be interpreted in various ways, such as randomly sampling from an
exponential number of different networks. Dropout layers first gained
popularity through their use in CNNs, but have since been applied to other layers, including input embeddings or recurrent networks.
An embedding maps an input representation, such as a word or
sentence, into a vector. A popular type of embedding are word embeddings
such as word2vec or GloVe.
We can also embed sentences, paragraphs or images. For example, by
mapping images and their textual descriptions into a common embedding
space and minimizing the distance between them, we can match labels with
images. Embeddings can be learned explicitly, such as in word2vec,
or as part of a supervised task, such as Sentiment Analysis. Often, the
input layer of a network is initialized with pre-trained embeddings,
which are then fine-tuned to the task at hand.
Exploding Gradient Problem
The Exploding Gradient Problem is the opposite of the Vanishing Gradient Problem.
In Deep Neural Networks gradients may explode during backpropagation,
resulting number overflows. A common technique to deal with exploding
gradients is to perform Gradient Clipping.
Fine-Tuning refers to the technique of initializing a network with
parameters from another task (such as an unsupervised training task),
and then updating these parameters based on the task at hand. For
example, NLP architecture often use pre-trained word embeddings like word2vec, and these word embeddings are then updated during training based for a specific task like Sentiment Analysis.
Gradient Clipping
Gradient Clipping is a technique to prevent exploding gradients
in very deep networks, typically Recurrent Neural Networks. There exist
various ways to perform gradient clipping, but the a common one is to
normalize the gradients of a parameter vector when its L2 norm exceeds a
certain threshold according to new_gradients = gradients * threshold / l2_norm(gradients).
GloVe is an unsupervised learning algorithm for obtaining vector representations (embeddings)
for words. GloVe vectors serve the same purpose as word2vec but have
different vector representations due to being trained on co-occurrence
statistics.
The name of the Convolutional Neural Network architecture that won the ILSVRC 2014 challenge. The network uses Inception modules to reduce the parameters and improve the utilization of the computing resources inside the network.
The Gated Recurrent Unit is a simplified version of an LSTM unit with
fewer parameters. Just like an LSTM cell, it uses a gating mechanism to
allow RNNs to efficiently learn long-range dependency by preventing the
vanishing gradient problem.
The GRU consists of a reset and update gate that determine which part
of the old memory to keep vs. update with new values at the current time
step.
A Highway Layer (paper)
is a type of Neural Network layer that uses a gating mechanism to
control the information flow through a layer. Stacking multiple Highway
Layers allows for training of very deep networks. Highway Layers work by
learning a gating function that chooses which parts of the inputs to
pass through and which parts to pass through a transformation function,
such as a standard affine layer for example. The basic formulation of a Highway Layer is T * h(x) + (1 - T) * x, where T is the learned gating function with values between 0 and 1, h(x) is an arbitrary input transformation and x is the input. Note that all of these must have the same size.
The ImageNet Large Scale Visual Recognition Challenge
evaluates algorithms for object detection and image classification at
large scale. It is the most popular academic challenge in computer
vision. Over the past years, Deep Learning techniques have led to a
significant reduction in error rates, from 30% to less than 5%, beating
human performance on several classification tasks.
Inception Module
Inception Modules are used in Convolutional Neural Networks to allow
for more efficient computation and deeper Networks trough a
dimensionality reduction with stacked 1×1 convolutions.
Kears
is a Python-based Deep Learning library that includes many high-level
building blocks for deep Neural Networks. It can run on top of either TensorFlow or Theano.
LSTM
Long Short-Term Memory networks were invented to prevent the vanishing gradient problem
in Recurrent Neural Networks by using a memory gating mechanism. Using
LSTM units to calculate the hidden state in an RNN we help to the
network to efficiently propagate gradients and learn long-range
dependencies.
A pooling
operations typically used in Convolutional Neural Networks. A
max-pooling layer selects the maximum value from a patch of features.
Just like a convolutional layer, pooling layers are parameterized by a
window (patch) size and stride size. For example, we may slide a window
of size 2×2 over a 10×10 feature matrix using stride size 2, selecting
the max across all 4 values within each window, resulting in a new 5×5
feature matrix. Pooling layers help to reduce the dimensionality of a
representation by keeping only the most salient information, and in the
case of image inputs, they provide basic invariance to translation (the
same maximum values will be selected even if the image is shifted by a
few pixels). Pooling layers are typically inserted between successive
convolutional layers.
MNIST
The MNIST data set
is the perhaps most commonly used Image Recognition dataset. It
consists of 60,000 training and 10,000 test examples of handwritten
digits. Each image is 28×28 pixels large. State of the art models
typically achieve accuracies of 99.5% or higher on the test set.
Momentum
Momentum is an extension to the Gradient Descent Algorithm that
accelerates or damps the parameter updates. In practice, including a
momentum term in the gradient descent updates leads to better
convergence rates in Deep Networks.
A Multilayer Perceptron is a Feedforward Neural Network with multiple fully-connected layers that use nonlinear activation functions
to deal with data which is not linearly separable. An MLP is the most
basic form of a multilayer Neural Network, or a deep Neural Networks if
it has more than 2 layers.
An NMT system uses Neural Networks to translate between languages,
such as English and French. NMT systems can be trained end-to-end using
bilingual corpora, which differs from traditional Machine Translation
systems that require hand-crafted features and engineering. NMT systems
are typically implemented using encoder and decoder recurrent neural
networks that encode a source sentence and produce a target sentence,
respectively.
NMTs are Neural Network architectures that can infer simple
algorithms from examples. For example, a NTM may learn a sorting
algorithm through example inputs and outputs. NTMs typically learn some
form of memory and attention mechanism to deal with state during program
execution.
Noise-contrastive estimation is a sampling loss typically used to
train classifiers with a large output vocabulary. Calculating the softmax
over a large number of possible classes is prohibitively expensive.
Using NCE, we can reduce the problem to binary classification problem by
training the classifier to discriminate between samples from the “real”
distribution and an artificially generated noise distribution.
RBMs are a type of probabilistic graphical model that can be
interpreted as a stochastic artificial neural network. RBNs learn a
representation of the data in an unsupervised manner. An RBN consists of
visible and hidden layer, and connections between binary neurons in
each of these layers. RBNs can be efficiently trained using Contrastive Divergence, an approximation of gradient descent.
A RNN models sequential interactions through a hidden state, or
memory. It can take up to N inputs and produce up to N outputs. For
example, an input sequence may be a sentence with the outputs being the
part-of-speech tag for each word (N-to-N). An input could be a sentence,
and the output a sentiment classification of the sentence (N-to-1). An
input could be a single image, and the output could be a sequence of
words corresponding to the description of an image (1-to-N). At each
time step, an RNN calculates a new hidden state (“memory”) based on the
current input and the previous hidden state. The “recurrent” stems from
the facts that at each step the same parameters are used and the network
performs the same calculations based on different inputs.
Recursive Neural Networks are a generalization of Recurrent Neural Networks
to a tree-like structure. The same weights are applied at each
recursion. Just like RNNs, Recursive Neural Networks can be trained
end-to-end using backpropagation. While it is possible to learn the tree
structure as part of the optimization problem, Recursive Neural
Networks are often applied to problem that already have a predefined
structure, like a parse tree in Natural Language Processing.
Short for Rectified Linear Unit(s). ReLUs are often used as activation functions in Deep Neural Networks. They are defined by f(x) = max(0, x). The advantages of ReLUs over functions like tanh include that they tend to be sparse (their activation easily be set to 0), and that they suffer less from the vanishing gradient problem.
ReLUs are the most commonly used activation function in Convolutional
Neural Networks. There exist several variations of ReLUs, such as Leaky ReLUs, Parametric ReLU (PReLU) or a smoother softplus approximation.
Deep Residual Networks won the ILSVRC 2015 challenge. These networks
work by introducing shortcut connection across stacks of layers,
allowing the optimizer to learn “easier” residual mappings instead of
the more complicated original mappings. These shortcut connections are
similar to Highway Layers,
but they are data-independent and don’t introduce additional parameters
or training complexity. ResNets achieved a 3.57% error rate on the
ImageNet test set.
RMSProp is a gradient-based optimization algorithm. It is similar to Adagrad, but introduces an additional decay term to counteract Adagrad’s rapid decrease in learning rate.
A Sequence-to-Sequence model reads a sequence (such as a sentence) as
an input and produces another sequence as an output. It differs from a
standard RNN
in that the input sequence is completely read before the network starts
producing any output. Typically, seq2seq models are implemented using
two RNNs, functioning as encoders and decoders. Neural Machine Translation is a typical example of a seq2seq model.
Stochastic Gradient Descent (Wikipedia)
is a gradient-based optimization algorithm that is used to learn
network parameters during the training phase. The gradients are
typically calculated using the backpropagation
algorithm. In practice, people use the minibatch version of SGD, where
the parameter updates are performed based on a batch instead of a single
example, increasing computational efficiency. Many extensions to
vanilla SGD exist, including Momentum, Adagrad, rmsprop, Adadelta or Adam.
The softmax function
is typically used to convert a vector of raw scores into class
probabilities at the output layer of a Neural Network used for
classification. It normalizes the scores by exponentiating and dividing
by a normalization constant. If we are dealing with a large number of
classes, a large vocabulary in Machine Translation for example, the
normalization constant is expensive to compute. There exist various
alternatives to make the computation more efficient, including Hierarchical Softmax or using a sampling-based loss such as NCE.
TensorFlow
TensorFlow
is an open source C++/Python software library for numerical computation
using data flow graphs, particularly Deep Neural Networks. It was
created by Google. In terms of design, it is most similar to Theano, and lower-level than Caffe or Keras.
Theano
Theano
is a Python library that allows you to define, optimize, and evaluate
mathematical expressions. It contains many building blocks for deep
neural networks. Theano is a low-level library similar to Tensorflow. Higher-level libraries include Keras and Caffe.
Vanishing Gradient Problem
The vanishing gradient problem arises in very deep Neural Networks,
typically Recurrent Neural Networks, that use activation functions whose
gradients tend to be small (in the range of 0 from 1). Because these
small gradients are multiplied during backpropagation, they tend to
“vanish” throughout the layers, preventing the network from learning
long-range dependencies. Common ways to counter this problem is to use
activation functions like ReLUs that do not suffer from small gradients, or use architectures like LSTMs that explicitly combat vanishing gradients. The opposite of this problem is called the exploding gradient problem.
VGG refers to convolutional neural network model that secured the
first and second place in the 2014 ImageNet localization and
classification tracks, respectively. The VGG model consist of 16–19
weight layers and uses small convolutional filters of size 3×3 and 1×1.
word2vec is an algorithm and tool to learn word embeddings
by trying to predict the context of words in a document. The resulting
word vectors have some interesting properties, for example vector('queen') ~= vector('king') - vector('man') + vector('woman').
Two different objectives can be used to learn these embeddings: The
Skip-Gram objective tries to predict a context from on a word, and the
CBOW objective tries to predict a word from its context.