Sunday, July 29, 2018

Differentiable Image Parameterizations

https://distill.pub/2018/differentiable-parameterizations/

Neural networks trained to classify images have a remarkable — and surprising! — capacity to generate images. Techniques such as DeepDream
[1]
, style transfer
[2]
, and feature visualization
[3]
leverage this capacity as a powerful tool for exploring the inner workings of neural networks, and to fuel a small artistic movement based on neural art. All these techniques work in roughly the same way. Neural networks used in computer vision have a rich internal representation of the images they look at. We can use this representation to describe the properties we want an image to have (e.g. style), and then optimize the input image to have those properties. This kind of optimization is possible because the networks are differentiable with respect to their inputs: we can slightly tweak the image to better fit the desired properties, and then iteratively apply such tweaks in gradient descent.
Typically, we parameterize the input image as the RGB values of each pixel, but that isn’t the only way. As long as the mapping from parameters to images is differentiable, we can still optimize alternative parameterizations with gradient descent.

Friday, July 27, 2018

Thoughts On Machine Learning Accuracy

https://aws.amazon.com/blogs/aws/thoughts-on-machine-learning-accuracy/

This blog shares some brief thoughts on machine learning accuracy and bias.
Let’s start with some comments about a recent ACLU blog in which they run a facial recognition trial. Using Rekognition, the ACLU built a face database using 25,000 publicly available arrest photos and then performed facial similarity searches of that database using public photos of all current members of Congress. They found 28 incorrect matches out of 535, using an 80% confidence level; this is a 5% misidentification (sometimes called ‘false positive’) rate and a 95% accuracy rate. The ACLU has not published its data set, methodology, or results in detail, so we can only go on what they’ve publicly said. But, here are some thoughts on their claims:
The default confidence threshold for facial recognition APIs in Rekognition is 80%, which is good for a broad set of general use cases (such as identifying celebrities on social media or family members who look alike in a photos app), but it’s not the right one for public safety use cases. The 80% confidence threshold used by the ACLU is far too low to ensure the accurate identification of individuals; we would expect to see false positives at this level of confidence. We recommend 99% for use cases where highly accurate face similarity matches are important (as indicated in our public documentation).
1. To illustrate the impact of confidence threshold on false positives, we ran a test where we created a face collection using a dataset of over 850,000 faces commonly used in academia. We then used public photos of all members of US Congress (the Senate and House) to search against this collection in a similar way to the ACLU blog.
When we set the confidence threshold at 99% (as we recommend in our documentation), our misidentification rate dropped to 0% despite the fact that we are comparing against a larger corpus of faces (30x larger than ACLU’s tests). This illustrates how important it is for those using ‎technology to help with public safety issues to pick appropriate confidence levels, so they have few (if any) false positives.
2. In real-world public safety and law enforcement scenarios, Amazon Rekognition is almost exclusively used to help narrow the field and allow humans to expeditiously review and consider options using their judgment (and not to make fully autonomous decisions), where it can help find lost children, fight against human trafficking, or prevent crimes. Rekognition is generally only the first step in identifying an individual. In other use cases (such as social media), there isn’t the same need to double check so that confidence thresholds can be lower.
3. In addition to setting the confidence threshold far too low, the Rekognition results can be significantly skewed by using a facial database that is not appropriately representative that is itself skewed. In this case, ACLU used a facial database of mugshots that may have had a material impact on the accuracy of Rekognition findings.
4. The advantage of a cloud-based machine learning application like Rekognition is that it is constantly improving as we continue to improve the algorithm with more data. Our customers immediately get the benefit of those improvements. We continue to focus on our mission of making Rekognition the most accurate and powerful tool for identifying people, objects, and scenes – and that certainly includes ensuring that the results are free of any bias that impacts accuracy.  We’ve been able to add a lot of value for customers and the world at large already with Rekognition in the fight against human trafficking, reuniting lost children with their families, reducing fraud for mobile payments, and improving security, and we’re excited about continuing to help our customers and society at large with Rekognition in the future.
5. There is a general misconception that people can match faces to photos better than machines. In fact, the National Institute for Standards and Technology (“NIST”) recently shared a study of facial recognition technologies that are at least two years behind the state of the art used in Rekognition and concluded that even those older technologies can outperform human facial recognition abilities.
A final word about the misinterpreted ACLU results. When there are new technological advances, we all have to clearly understand what’s real and what’s not. There’s a difference between using machine learning to identify a food object and using machine learning to determine whether a face match should warrant considering any law enforcement action. The latter is serious business and requires much higher confidence levels. We continue to recommend that customers do not use less than 99% confidence levels for law enforcement matches, and then to only use the matches as one input across others that make sense for each agency. But, machine learning is a very valuable tool to help law enforcement agencies, and while being concerned it’s applied correctly, we should not throw away the oven because the temperature could be set wrong and burn the pizza. It is a very reasonable idea, however, for the government to weigh in and specify what temperature (or confidence levels) it wants law enforcement agencies to meet to assist in their public safety work.
Dr. Matt Wood

Dr. Matt Wood

Neural Networks Are Essentially Polynomial Regression

https://matloff.wordpress.com/2018/06/20/neural-networks-are-essentially-polynomial-regression/

You may be interested in my new arXiv paper, joint work with Xi Cheng, an undergraduate at UC Davis (now heading to Cornell for grad school); Bohdan Khomtchouk, a post doc in biology at Stanford; and Pete Mohanty,  a Science, Engineering & Education Fellow in statistics at Stanford. The paper is of a provocative nature, and we welcome feedback.
A summary of the paper is:
  • We present a very simple, informal mathematical argument that neural networks (NNs) are in essence polynomial regression (PR). We refer to this as NNAEPR.
  • NNAEPR implies that we can use our knowledge of the “old-fashioned” method of PR to gain insight into how NNs — widely viewed somewhat warily as a “black box” — work inside.
  • One such insight is that the outputs of an NN layer will be prone to multicollinearity, with the problem becoming worse with each successive layer. This in turn may explain why convergence issues often develop in NNs. It also suggests that NN users tend to use overly large networks.
  • NNAEPR suggests that one may abandon using NNs altogether, and simply use PR instead.
  • We investigated this on a wide variety of datasets, and found that in every case PR did as well as, and often better than, NNs.
  • We have developed a feature-rich R package, polyreg, to facilitate using PR in multivariate settings.
Much work remains to be done (see paper), but our results so far are very encouraging. By using PR, one can avoid the headaches of NN, such as selecting good combinations of tuning parameters, dealing with convergence problems, and so on.
Also available are the slides for our presentation at GRAIL on this project.

Wednesday, July 18, 2018

Foundations of Machine Learning

https://bloomberg.github.io/foml/#about

Understand the Concepts, Techniques and Mathematical Frameworks Used by Experts in Machine Learning

About This Course

Bloomberg presents "Foundations of Machine Learning," a training course that was initially delivered internally to the company's software engineers as part of its "Machine Learning EDU" initiative. This course covers a wide variety of topics in machine learning and statistical modeling. The primary goal of the class is to help participants gain a deep understanding of the concepts, techniques and mathematical frameworks used by experts in machine learning. It is designed to make valuable machine learning skills more accessible to individuals with a strong math background, including software developers, experimental scientists, engineers and financial professionals.
The 30 lectures in the course are embedded below, but may also be viewed in this YouTube playlist. The course includes a complete set of homework assignments, each containing a theoretical element and implementation challenge with support code in Python, which is rapidly becoming the prevailing programming language for data science and machine learning in both academia and industry. This course also serves as a foundation on which more specialized courses and further independent study can build.
Check back soon for how to register for our Piazza discussion board. Common questions from previous editions of the course are posted in our FAQ.
The first lecture, Black Box Machine Learning, gives a quick start introduction to practical machine learning and only requires familiarity with basic programming concepts.

Prerequisites

The quickest way to see if the mathematics level of the course is for you is to take a look at this mathematics assessment, which is a preview of some of the math concepts that show up in the first part of the course.
  • Solid mathematical background, equivalent to a 1-semester undergraduate course in each of the following: linear algebra, multivariate differential calculus, probability theory, and statistics. The content of NYU's DS-GA-1002: Statistical and Mathematical Methods would be more than sufficient, for example.
  • Python programming required for most homework assignments.
  • Recommended: At least one advanced, proof-based mathematics course
  • Recommended: Computer science background up to a "data structures and algorithms" course


Thursday, July 12, 2018

What’s New in Deep Learning Research: Understanding DeepMind’s IMPALA

https://towardsdatascience.com/whats-new-in-deep-learning-research-understanding-deepmind-s-impala-4fbfa5d0ad0c

Deep reinforcement learning has rapidly become one of the hottest research areas in the deep learning ecosystem. The fascination with reinforcement learning is related to the fact that, from all the deep learning modalities, is the one that resemble the most how humans learn. In the last few years, no company in the world has done more to advance the stage of deep reinforcement learning than Alphabet’s subsidiary DeepMind.
Since the launch of its famous AlphaGo agent, DeepMind has been at the forefront of reinforcement learning research. A few days ago, they published a new research that attempts to tackle one of the most challenging aspects of reinforcement learning solutions: multi-tasking.
Since we are infants, multi-tasking becomes an intrinsic element of our cognition. The ability to performing and learning similar tasks concurrently is essential to the development of the human mind. From the neuroscientific standpoint, multi-tasking remains largely a mystery and that, not surprisingly, we have had a heck of hard time implementing artificial intelligence(AI) agents that can efficiently learn multiple domains without requiring a disproportional amount of resources. This challenge is even more evident in the case of deep reinforcement learning models that are based on trial and error exercises which can easily cross the boundaries of a single domain. Biologically speaking, you can argue that all learning is a multi-tasking exercise.
Let’s take a classic deep reinforcement learning scenario such as self-driving vehicles. In that scenarios, AI agents need to concurrently learn different aspects such as distance, memory or navigation while operating under rapidly changing parameters such as vision quality or speed. Most reinforcement learning methods today are focused on learning a single task and the models that track multi-task learning are too difficult to scale to be practical.
In their recent research the DeepMind team proposed a new architecture for deep reinforcement multi-task learning called Importance Weighted Actor-Learner Architecture (IMPALA). Inspired by another popular reinforcement learning architecture called A3C, IMPALA leverages a topology of different actors and learners that can collaborate to build knowledge across different domains. Traditionally, deep reinforcement learning models use an architecture based on a single learner combined with multiple actors. In that model, the Each actor generates trajectories and sends them via a queue to the learner. Before starting the next trajectory, actor retrieves the latest policy parameters from learner. IMPALA uses an architecture that collect experience which is passed to a central learner that computes gradients, resulting in a model that has completely independent actors and learners. This simple architecture enables the learner(s) to be accelerated using GPUs and actors to be easily distributed across many machines.
In addition to the multi-actor architecture model, the IMPALA research also introduces a new algorithm called V-Trace that focuses off-policy learning. The idea of V-Trace is to mitigate the lag between when actions are generated by the actors and when the learner estimates the gradient.
The DeepMind team tested IMPALA on different scenarios using its famous DMLab-30 training set and the results were impressive. IMPALA proved to achieve better performance compared to A3C variants in terms of data efficiency, stability and final performance. This might be the first deep reinforcement learning models that has been able to efficiently operate in multi-task environments.

Tuesday, July 10, 2018

Places: A 10 million image database for scene recognition


The Places dataset is designed following principles of human visual cognition. Our goal is to build a core of visual knowledge that can be used to train artificial systems for high-level visual understanding tasks, such as scene context, object recognition, action and event prediction, and theory-of-mind inference. The semantic categories of Places are defined by their function: the labels represent the entry-level of an environment. To illustrate, the dataset has different categories of bedrooms, or streets, etc, as one does not act the same way, and does not make the same predictions of what can happen next, in a home bedroom, an hotel bedroom or a nursery.
In total, Places contains more than 10 million images comprising 400+ unique scene categories. The dataset features 5000 to 30,000 training images per class, consistent with real-world frequencies of occurrence. Using convolutional neural networks (CNN), Places dataset allows learning of deep scene features for various scene recognition tasks, with the goal to establish new state-of-the-art performances on scene-centric benchmarks. Here we provide the Places Database and the trained CNNs for academic research and education purposes.

http://places2.csail.mit.edu/

http://places2.csail.mit.edu/download.html

Monday, July 9, 2018

Big Data Small Machine

https://adamdrake.com/big-data-small-machine.html

Introduction

I was honored to be invited by DevTO to give a talk at their May meetup. The organizers were keen to have someone speak about high-performance machine learning, and I was happy to oblige.
The general thesis of the talk is that, for the purposes of machine learning, setting up large compute clusters is wholly unnecessary. Furthermore, it should generally be considered harmful as those efforts are extremely time consuming and detract from solving the actual machine learning problem at hand.
To illustrate the point, I showed an online learning approach to binary classification problems using logistic regression with adaptive learning rates. While some might dismiss this approach as too simplistic or ineffective, consider that it is not very different from what Google was (is?) using for some of their online advertising prediction systems. This was described in the wonderful paper Ad Click Prediction: a View from the Trenches.
As in previous summaries of my lectures, I’ll reference select slides by section header and provide the explanation that went along with the slide, including some elaboration I may not have had time for in the lecture itself.

Claims

In my lecture I made a few general claims:
  • RAM in machines used to process data is growing more quickly than the data itself
  • There are many techniques for dealing with so-called Big Data and none of which involve clusters or heavy data infrastructure components like Kafka, Hadoop, Spark, and so on
  • One machine is fine for machine learning tasks, i.e., actually training ML models

How well do IBM, Microsoft, and Face++ AI services guess the gender of a face?

How well do IBM, Microsoft, and Face++ AI services guess the gender of a face?


http://gendershades.org/index.html

Sunday, July 8, 2018

Semantic segmentation on aerial and satellite imagery

https://github.com/mapbox/robosat

Semantic segmentation on aerial and satellite imagery. Extracts features such as: buildings, parking lots, roads, water


RoboSat is an end-to-end pipeline written in Python 3 for feature extraction from aerial and satellite imagery. Features can be anything visually distinguishable in the imagery for example: buildings, parking lots, roads, or cars.
Have a look at
The tools RoboSat comes with can be categorized as follows:
  • data preparation: creating a dataset for training feature extraction models
  • training and modeling: segmentation models for feature extraction in images
  • post-processing: turning segmentation results into cleaned and simple geometries
Tools work with the Slippy Map tile format to abstract away geo-referenced imagery behind tiles of the same size.

Friday, July 6, 2018

A Tour of The Top 10 Algorithms for Machine Learning Newbies

A Tour of The Top 10 Algorithms for Machine Learning Newbies

In machine learning, there’s something called the “No Free Lunch” theorem. In a nutshell, it states that no one algorithm works best for every problem, and it’s especially relevant for supervised learning (i.e. predictive modeling).
For example, you can’t say that neural networks are always better than decision trees or vice-versa. There are many factors at play, such as the size and structure of your dataset.
As a result, you should try many different algorithms for your problem, while using a hold-out “test set” of data to evaluate performance and select the winner.
Of course, the algorithms you try must be appropriate for your problem, which is where picking the right machine learning task comes in. As an analogy, if you need to clean your house, you might use a vacuum, a broom, or a mop, but you wouldn’t bust out a shovel and start digging.

The Big Principle

However, there is a common principle that underlies all supervised machine learning algorithms for predictive modeling.
Machine learning algorithms are described as learning a target function (f) that best maps input variables (X) to an output variable (Y): Y = f(X)
This is a general learning task where we would like to make predictions in the future (Y) given new examples of input variables (X). We don’t know what the function (f) looks like or its form. If we did, we would use it directly and we would not need to learn it from data using machine learning algorithms.
The most common type of machine learning is to learn the mapping Y = f(X) to make predictions of Y for new X. This is called predictive modeling or predictive analytics and our goal is to make the most accurate predictions possible.
For machine learning newbies who are eager to understand the basic of machine learning, here is a quick tour on the top 10 machine learning algorithms used by data scientists.

1 — Linear Regression

Linear regression is perhaps one of the most well-known and well-understood algorithms in statistics and machine learning.
Predictive modeling is primarily concerned with minimizing the error of a model or making the most accurate predictions possible, at the expense of explainability. We will borrow, reuse and steal algorithms from many different fields, including statistics and use them towards these ends.
The representation of linear regression is an equation that describes a line that best fits the relationship between the input variables (x) and the output variables (y), by finding specific weightings for the input variables called coefficients (B).
Linear Regression
For example: y = B0 + B1 * x
We will predict y given the input x and the goal of the linear regression learning algorithm is to find the values for the coefficients B0 and B1.
Different techniques can be used to learn the linear regression model from data, such as a linear algebra solution for ordinary least squares and gradient descent optimization.
Linear regression has been around for more than 200 years and has been extensively studied. Some good rules of thumb when using this technique are to remove variables that are very similar (correlated) and to remove noise from your data, if possible. It is a fast and simple technique and good first algorithm to try.

2 — Logistic Regression

Logistic regression is another technique borrowed by machine learning from the field of statistics. It is the go-to method for binary classification problems (problems with two class values).
Logistic regression is like linear regression in that the goal is to find the values for the coefficients that weight each input variable. Unlike linear regression, the prediction for the output is transformed using a non-linear function called the logistic function.
The logistic function looks like a big S and will transform any value into the range 0 to 1. This is useful because we can apply a rule to the output of the logistic function to snap values to 0 and 1 (e.g. IF less than 0.5 then output 1) and predict a class value.
Logistic Regression
Because of the way that the model is learned, the predictions made by logistic regression can also be used as the probability of a given data instance belonging to class 0 or class 1. This can be useful for problems where you need to give more rationale for a prediction.
Like linear regression, logistic regression does work better when you remove attributes that are unrelated to the output variable as well as attributes that are very similar (correlated) to each other. It’s a fast model to learn and effective on binary classification problems.

3 — Linear Discriminant Analysis

Logistic Regression is a classification algorithm traditionally limited to only two-class classification problems. If you have more than two classes then the Linear Discriminant Analysis algorithm is the preferred linear classification technique.
The representation of LDA is pretty straight forward. It consists of statistical properties of your data, calculated for each class. For a single input variable this includes:
  1. The mean value for each class.
  2. The variance calculated across all classes.
Linear Discriminant Analysis
Predictions are made by calculating a discriminate value for each class and making a prediction for the class with the largest value. The technique assumes that the data has a Gaussian distribution (bell curve), so it is a good idea to remove outliers from your data before hand. It’s a simple and powerful method for classification predictive modeling problems.

4 — Classification and Regression Trees

Decision Trees are an important type of algorithm for predictive modeling machinelearning.
The representation of the decision tree model is a binary tree. This is your binary tree from algorithms and data structures, nothing too fancy. Each node represents a single input variable (x) and a split point on that variable (assuming the variable is numeric).
Decision Tree
The leaf nodes of the tree contain an output variable (y) which is used to make a prediction. Predictions are made by walking the splits of the tree until arriving at a leaf node and output the class value at that leaf node.
Trees are fast to learn and very fast for making predictions. They are also often accurate for a broad range of problems and do not require any special preparation for your data.

5 — Naive Bayes

Naive Bayes is a simple but surprisingly powerful algorithm for predictive modeling.
The model is comprised of two types of probabilities that can be calculated directly from your training data: 1) The probability of each class; and 2) The conditional probability for each class given each x value. Once calculated, the probability model can be used to make predictions for new data using Bayes Theorem. When your data is real-valued it is common to assume a Gaussian distribution (bell curve) so that you can easily estimate these probabilities.
Bayes Theorem
Naive Bayes is called naive because it assumes that each input variable is independent. This is a strong assumption and unrealistic for real data, nevertheless, the technique is very effective on a large range of complex problems.

6 — K-Nearest Neighbors

The KNN algorithm is very simple and very effective. The model representation for KNN is the entire training dataset. Simple right?
Predictions are made for a new data point by searching through the entire training set for the K most similar instances (the neighbors) and summarizing the output variable for those K instances. For regression problems, this might be the mean output variable, for classification problems this might be the mode (or most common) class value.
The trick is in how to determine the similarity between the data instances. The simplest technique if your attributes are all of the same scale (all in inches for example) is to use the Euclidean distance, a number you can calculate directly based on the differences between each input variable.
K-Nearest Neighbors
KNN can require a lot of memory or space to store all of the data, but only performs a calculation (or learn) when a prediction is needed, just in time. You can also update and curate your training instances over time to keep predictions accurate.
The idea of distance or closeness can break down in very high dimensions (lots of input variables) which can negatively affect the performance of the algorithm on your problem. This is called the curse of dimensionality. It suggests you only use those input variables that are most relevant to predicting the output variable.

7 — Learning Vector Quantization

A downside of K-Nearest Neighbors is that you need to hang on to your entire training dataset. The Learning Vector Quantization algorithm (or LVQ for short) is an artificial neural network algorithm that allows you to choose how many training instances to hang onto and learns exactly what those instances should look like.
Learning Vector Quantization
The representation for LVQ is a collection of codebook vectors. These are selected randomly in the beginning and adapted to best summarize the training dataset over a number of iterations of the learning algorithm. After learned, the codebook vectors can be used to make predictions just like K-Nearest Neighbors. The most similar neighbor (best matching codebook vector) is found by calculating the distance between each codebook vector and the new data instance. The class value or (real value in the case of regression) for the best matching unit is then returned as the prediction. Best results are achieved if you rescale your data to have the same range, such as between 0 and 1.
If you discover that KNN gives good results on your dataset try using LVQ to reduce the memory requirements of storing the entire training dataset.

8 — Support Vector Machines

Support Vector Machines are perhaps one of the most popular and talked about machine learning algorithms.
A hyperplane is a line that splits the input variable space. In SVM, a hyperplane is selected to best separate the points in the input variable space by their class, either class 0 or class 1. In two-dimensions, you can visualize this as a line and let’s assume that all of our input points can be completely separated by this line. The SVM learning algorithm finds the coefficients that results in the best separation of the classes by the hyperplane.
Support Vector Machine
The distance between the hyperplane and the closest data points is referred to as the margin. The best or optimal hyperplane that can separate the two classes is the line that has the largest margin. Only these points are relevant in defining the hyperplane and in the construction of the classifier. These points are called the support vectors. They support or define the hyperplane. In practice, an optimization algorithm is used to find the values for the coefficients that maximizes the margin.
SVM might be one of the most powerful out-of-the-box classifiers and worth trying on your dataset.

9 — Bagging and Random Forest

Random Forest is one of the most popular and most powerful machine learning algorithms. It is a type of ensemble machine learning algorithm called Bootstrap Aggregation or bagging.
The bootstrap is a powerful statistical method for estimating a quantity from a data sample. Such as a mean. You take lots of samples of your data, calculate the mean, then average all of your mean values to give you a better estimation of the true mean value.
In bagging, the same approach is used, but instead for estimating entire statistical models, most commonly decision trees. Multiple samples of your training data are taken then models are constructed for each data sample. When you need to make a prediction for new data, each model makes a prediction and the predictions are averaged to give a better estimate of the true output value.
Random Forest
Random forest is a tweak on this approach where decision trees are created so that rather than selecting optimal split points, suboptimal splits are made by introducing randomness.
The models created for each sample of the data are therefore more different than they otherwise would be, but still accurate in their unique and different ways. Combining their predictions results in a better estimate of the true underlying output value.
If you get good results with an algorithm with high variance (like decision trees), you can often get better results by bagging that algorithm.

10 — Boosting and AdaBoost

Boosting is an ensemble technique that attempts to create a strong classifier from a number of weak classifiers. This is done by building a model from the training data, then creating a second model that attempts to correct the errors from the first model. Models are added until the training set is predicted perfectly or a maximum number of models are added.
AdaBoost was the first really successful boosting algorithm developed for binary classification. It is the best starting point for understanding boosting. Modern boosting methods build on AdaBoost, most notably stochastic gradient boosting machines.
AdaBoost
AdaBoost is used with short decision trees. After the first tree is created, the performance of the tree on each training instance is used to weight how much attention the next tree that is created should pay attention to each training instance. Training data that is hard to predict is given more weight, whereas easy to predict instances are given less weight. Models are created sequentially one after the other, each updating the weights on the training instances that affect the learning performed by the next tree in the sequence. After all the trees are built, predictions are made for new data, and the performance of each tree is weighted by how accurate it was on training data.
Because so much attention is put on correcting mistakes by the algorithm it is important that you have clean data with outliers removed.

Last Takeaway

A typical question asked by a beginner, when facing a wide variety of machine learning algorithms, is “which algorithm should I use?” The answer to the question varies depending on many factors, including: (1) The size, quality, and nature of data; (2) The available computational time; (3) The urgency of the task; and (4) What you want to do with the data.
Even an experienced data scientist cannot tell which algorithm will perform the best before trying different algorithms. Although there are many other Machine Learning algorithms, these are the most popular ones. If you’re a newbie to Machine Learning, these would be a good starting point to learn.
— —
If you enjoyed this piece, I’d love it if you hit the clap button 👏 so others might stumble upon it. You can find my own code on GitHub, and more of my writing and projects at https://jameskle.com/. You can also follow me on Twitter, email me directly or find me on LinkedIn. Sign up for my newsletter to receive my latest thoughts on data science, machine learning, and artificial intelligence right at your inbox!

Sunday, July 1, 2018

Which whale is it, anyway? Face recognition for right whales using deep learning

https://deepsense.ai/deep-learning-right-whale-recognition-kaggle/

Learning Rate Finder for Keras

https://github.com/LucasAnders1/LearningRateFinder
https://github.com/LucasAnders1/LearningRateFinder/blob/master/lr_finder_callback.py

This repository includes a Keras callback which can be used to find an optimal learning rate for a Keras model,as described in Leslie Smith's paper: https://arxiv.org/abs/1506.01186
Choosing the right learning rate for a deep network can be tricky. A low learning rate may take very long to converge against an optimal solution, while a higher learning rate quickly converges, but may never find the best solution.
The fast.ai library for pytorch offers a Learning Rate Finder to quickly find a good learning rate. In Keras, a similar solution can be realised by using a callback.
The callback can be used with any Keras Models and increases the learning rate while training the model. The learning rate which yields the minimal training loss is supposed to perform well in training.

Estimating an Optimal Learning Rate For a Deep Neural Network

https://towardsdatascience.com/estimating-optimal-learning-rate-for-a-deep-neural-network-ce32f2556ce0

The learning rate is one of the most important hyper-parameters to tune for training deep neural networks.
In this post, I’m describing a simple and powerful way to find a reasonable learning rate that I learned from fast.ai Deep Learning course. I’m taking the new version of the course in person at University of San Francisco. It’s not available to the general public yet, but will be at the end of the year at course.fast.ai (which currently has the last year’s version).

Structuring Deep Learning Projects A Step-by-Step Guide

https://khanna.cc/blog/structuring-deep-learning-projects/

You want to train a deep neural network. You have the data. It’s labeled and wrangled into a useful format. What do you do now?
When I have a deep learning project, I follow these six steps.
Step 1. Pick a cost function.
Step 2. Pick an initial network architecture.
Step 3. Fit the training set well on the cost function.
Step 4. Fit the validation set well on the cost function.
Step 5. Verify performance on a test set.
Step 6. Verify performance in the real world.

Step 1. Pick a Cost Function

The appropriate cost function depends on the type of problem I’m trying to solve. If I’m predicting an output value given an input value, i.e., a regression problem, I use mean squared error (MSE) loss. If I’m solving a classification problem with more than two classes, use cross-entropy loss. If there are only two classes, I use binary cross-entropy loss. These standard loss functions are available in all major deep learning frameworks.
If I have an unusual problem, such as One-Shot Learning, I may need to design and use a custom loss function.

Step 2. Pick an Initial Network Architecture

For structured learning problems like predicting sales, I’ve found one, fully-connected hidden layer to be a good starting place. The number of activations in that layer should be between the number of input neurons and the number of output neurons. The midpoint between the two usually isn’t a bad choice.
I’ve also seen the following formula, which is fine as a starting point.
Nh=Ns(a(Ni+No))
Ni
= number of input neurons
No = number of output neurons
Ns = number of samples in the training set
a = an arbitrary scaling factor between 2 and 10
If I’m doing computer vision, the best place to start is one of the proven architectures like ResNet.

Step 3. Fit the Training Set Well on the Cost Function

The most important hyperparameter in fitting the training set is the learning rate (α
). I don’t bother with trial-and-error and instead use a learning rate finder that I adapted from the fastai library. The learning rate finder outputs a plot that looks like this: Learning rate finder plot
I choose a learning rate where the loss is still clearly decreasing. I tend to pick a point that is a little bit to the right of the steepest point in the plot, i.e., where the loss is still strongly decreasing and has not yet been minimized. In the above plot, I would choose a learning rate of 104
.
If the model is training very slowly or not very well, I replace the initial stochasic gradient descent optimization algorithm with Adam optimization.
If the model still isn’t properly fitting the training set, I’ll add learning rate decay.1 Learning rate decay is reported to have been tried a few different ways: exponential decay, discrete “staircase” decay, and even manual approaches where the experimenter will visually observe when the loss stops decreasing and reduces α
accordingly.
I prefer to use a cosine-shaped learning rate decay, which decays α
over the course of an epoch2 in the shape of a cosine curve. This makes the rate of decay slowest at the beginning and end of the epoch and highest in the middle of the epoch.
I’ll usually also add “restarts” which restarts the learning rate at it’s non-decayed value at the beginning of each epoch (or some integer multiple of epochs).
If I’m using some form of transfer learning, I’ll try unfreezing the earlier layers of the neural network and training it with differential learning rates.
At this point, if I’m still having trouble fitting the training set, I have a few more hyperparameters I’ll tune either sequentially or by random selection in the hyperparameter space.3
  • number of units in hidden layer
  • minibatch size, in powers of 2: 64, 128, 256, 512. Rarely, 1024.
  • number of hidden layers
I don’t bother tuning β1
, β2 or ϵ for the Adam optimizer, since it doesn’t seem to make much difference.
If, after all this, I still cannot fit the training data well on the cost function, I consider whether the overall objective should be refined. The training inputs simply may not have the right information to predict the output I’m trying to fit.
A good example of this is trying to predict a stock’s future price based solely on that stock’s historical prices. If a stock’s movement is purely a stochastic process as it relates to its historical price, I’m not surprised when I can’t predict it’s future price with a deep neural network.4

Step 4. Fit the Validation Set Well on Cost Function

This is the step where I mainly spend my time. Getting a deep neural network to generalize well to a validation set is often the most difficult part of the project. Your deep neural network at this point almost assuredly performs better on the training set than it does on the validation set. There is in other words at least some degree of overfitting to the training set.
Here are the steps I take to reduce validation set loss.

Dropout

First, I’ll try applying dropout, which randomly zeros-out activations in the training set during training with some probability p
. There’s no great rule of thumb for how I set p, but I’ll try a couple of things.
I’ll apply dropout before the last linear layer at p=0.25
and I’ll run multiple experiments tuning dropout up to p=0.50. If that doesn’t help, I’ll add dropout to earlier linear layers of the neural network with a similar range of p=0.25 to p=0.50. Like I said, there isn’t a great rule of thumb here, so it ends up being some experimentation and trial-and-error to determine which combination of dropout probabilities in which layers is the most effective at reducing validation loss.

Weight Decay / L2 Regularization

Second, I’ll add some form of weight decay, such as L2 regularization. This is a classic technique in machine learning that reduces overfitting. It works by adding a term to the cost function that looks like this:
λ2j=1nw2j
λ
= regularization hyperparameter
wj = feature j in weight matrix w
n = number of features in weight matrix w One of the causes of overfitting is large weights, and this form of weight decay heavily penalizes large weights in the weight matrix. By adding this term to the cost function, a gradient step change that increases the size of the weights in the weight matrix must improve the network more than the increased cost associated with larger weights. The weight decay parameter λ
is an important one that has to be tuned.

Normalize Inputs

The third thing I’ll do to reduce overfitting is normalize the mean and variance of the input features over the training set. That means the average of input feature x1
over all training examples m should equal 0 and the standard deviation should be 1. Same with features x2...xn.
In math terms, the vector x(i)
containing all of the features j for a single training example i should be modified like this:
μ=1mi=1mx(i)x:=xμ
μ
is a vector with size equal to the number of input features in a single training example. x is a matrix of size n by m, where n is the number of input features in a single training example and m is the total number of training examples. The substraction operator in the second equation broadcasts the vector μ such that that vector is subtracted from each of the columns of matrix x.
The formula for normalizing the standard deviations should be familiar:
σ2=1mi=1m(x(i))2x:=xσ2
I want to reiterate that I normalize the mean and standard deviation for each feature over the training set, not over the features for each sample in the training set. It seems obvious now, but that detail tripped me up initially.
Another important detail is that I keep track of the mean and standard deviation of the inputs from my training set and use those statistics to normalize the validation and test sets. In other words, I don’t calculate a mean and standard deviation over the validation set or test set data. Rather, I use the scaling factors from the training set to scale my validation and test sets.

Batch Normalization

Batch normalization is becoming an increasingly standard part of any deep neural network. Whereas the previous technique normalized the input features, the batch norm technique reduces overfitting by normalizing the mean and variances of the hidden layer activations.
Batch norm is done per-minibatch. Like the input feature normalization, I use the mean and variances from the training set to apply batch norm to the validation and test sets. But unlike input feature normalization, batch norm is done per-minibatch and not over the whole training set, so I actually use an exponentially weighted average of the mean and variances from the training set.
That can be a little mind-bending, so I’ll say it another way. In batch norm, the mean and variance is taken on an activation-by-activation basis over the training samples in the minibatch. Since that will yield as many means and variances as there are minibatches, I then apply an exponentially weighted moving average across the minibatches to obtain a final mean and variance to apply to my validation and test sets.

More Training Data & Data Augmentation

In my experience, overfitting is probably best solved by adding more data to the training set and continuing to train the network. Collecting more data can be expensive and time-consuming, which is why it’s not at the top of my list.
Another approach is to use data augmentation. Data augmentation is the process of synthesizing data to increase the data available to train the network. It is an excellent option in computer vision problems where it can be done by adjusting the lighting, rotation, orientation, or other visual characteristics of the image. In this way, for every image, you could get five or more images from it by transforming it in subtle ways. After all, if you’re trying to build a network that can detect cars on the road, whether that car is oriented left or right, and whether it is sunny or rainy, there is still a car in the photo.
Data augmention as applied to structured data or natural language processing has not really been studied.5

Vanishing and Exploding Gradients

Often, I’ll encounter vanishing gradients or exploding gradients. Vanishing gradients are where where the gradients become so vanishingly small that the network does not learn well with gradient descent. Exploding gradients are where the gradients become so large, they exceed the capability of the computer to calculate.
A first pass at solving both the problem of exploding and vanishing gradients is to initialize the weight matrices in a particular way: so-called, He initialization.
He initialization is characterized by initializing the weight matrix W[l]
to a zero-mean gaussian distribution with a standard deviation of:
2l1
Although I did not expect it, initializing the weights in this particular way has profound effects on minimizing the likelihood of vanishing or exploding gradients.
If I’m training a recurrent neural network for something like language processing, I’ll almost always default to using a Long Short-Term Memory unit in my recurrent neural network, which also operates to reduce the likelihood of vanishing or exploding gradients.
Finally, a couple of notes about exploding gradients specifically. If I start to see NaN appear in my gradients, it’s likely that I have an exploding gradient problem. A blunt but effective way of dealing with exploding gradients is to use gradient clipping. Gradient clipping sets an upper bound on the gradients. If the gradient exceeds a certain threshold, it is “clipped” to the threshold value.
Sometimes, the tuning of these hyperparameters does not seem to make much of a difference. No matter how we tune them, the training set loss is still acceptably low, but the validation set loss is much higher. It therefore often makes sense to take a hard look at the architecture of the neural network. Too many features in the input can cause the network to fit noise in the training set. Too many activations in the hidden layers can do the same thing. So I tinker with the number of neurons in the hidden layers, perhaps reducing them in one hidden layer but increasing them in another hidden layer. Or I will increase or reduce the number of hidden layers themselves.
This technique, like most of the techniques in reducing overfitting, is a lot of trial-and-error. I often need to experiment with many different architectures until I find one that works well.

Step 5. Verify Performance on a Test Set

If the neural network performs well on the training and validation sets, I’m feeling pretty good about it. The main risk at this point is that, in the process of optimizing the network and tuning the hyperparameters, I’ve accidentally overfit my validation set.
When I fit the validation set to the cost function in Step 4, I looked to the loss, F-score or some other optimizing metric with respect to the validation set. In essence, I tuned the hyperparameters specifically optimizing on the validation set. In that way, I may have fit the hyperparameters (and therefore the neural network) to noise in the validation set, and the network may generalize poorly to data it has not yet seen.
So, in this step, I run the network on an as-yet-unseen test set to confirm the same results I saw in the validation set. If the network performs poorly on a test set, I increase the size of the validation set, either by additional data or by data augmentation. I then repeat Step 4 and fit the larger validation before coming back to this Step 5 to verify its performance on a test set.
Don’t tune your network’s hyperparameters with respect to the test set loss! If you do, you’ll end up with a network overfit on your training set, validation set and test set and you won’t realize it.

Step 6. Verify Performance in the Real World

Now the fun part! Make sure the network performs well in the real world. If you’ve trained a cat classifer, start feeding the network pictures of your cat. If you’ve trained a recurrent neural network sentiment classifer for corporate press releases, feed it Microsoft’s latest press releases and see how it does.
If the network performs poorly in the real world, but performs well on the training, validation and test sets, something is wrong. You may have overfit to your test set somehow. Change your validation and test sets and see if the network still performs well on those. If it does, and it still has issues performing in the real world, it may be time to re-evaluate the cost function you are using (or your overall objective).
To be honest, I have not encountered this situation. In my experience, if I successfully make it to this Step 6, the network does not have issues in the real world.

Summary Checklist

Below is a short, checklist summary of the things to do or look out for in each step.

Acknowledgements

This post is inspired by my experiences in the deeplearning.ai Coursera specialization as well as the Practical Deep Learning for Coders course. Many thanks to Andrew Ng and Jeremy Howard for their hard work as educators.

Footnotes

  1. This learning rate decay is in addition to the effective decay that is part of the Adam optimizer. 
  2. An epoch is a single full pass through all the training data. It is comprised of minibatches. 
  3. If I’m tuning multiple hyperparameters at once, I avoid grid searches in the hyperparameter space in favor of random selection. 
  4. If you think you have, you’ve almost certainly just fit noise / overfit on the the training set. 
  5. I am currently studying the potential for financial data augmentation using synthetic financial data. To date, synthetic financial data has been used to improve backtesting by using historical information to generate a dataset with statistical characteristics similar to the observed data. I’m hopeful that the same principles will lend themselves well to augmentation of training data.