Wednesday, December 26, 2018

Saturday, December 22, 2018

UPenn and Mayo Clinic's Seizure Detection Challenge

https://www.kaggle.com/c/seizure-detection/discussion/10111
https://github.com/MichaelHills/seizure-detection/raw/master/seizure-detection.pdf
https://github.com/MichaelHills/seizure-detection

Epilepsy Ecosystem. Melbourne Seizure Prediction Data

https://www.epilepsyecosystem.org/

Epilepsyecosystem.org is a crowd-sourcing ecosystem for improving the performance of seizure prediction algorithms in order to make seizure prediction a viable treatment option for those suffering from epilepsy.
Epilepsy afflicts nearly 1% of the world's population, and is characterized by the occurrence of spontaneous seizures. For many patients, anticonvulsant medications can be given at sufficiently high doses to prevent seizures, but patients frequently suffer side effects. For 20-40% of patients with epilepsy, medications are not effective. Even after surgical removal of epilepsy, many patients continue to experience spontaneous seizures. Despite the fact that seizures occur infrequently, patients with epilepsy experience persistent anxiety due to the possibility of a seizure occurring.
Seizure forecasting systems have the potential to help patients with epilepsy lead more normal lives. In order for electrical brain activity (EEG) based seizure forecasting systems to work effectively, computational algorithms must reliably identify periods of increased probability of seizure occurrence. If these seizure-permissive brain states can be identified, devices designed to warn patients of impeding seizures would be possible. Patients could avoid potentially dangerous activities like driving or swimming, and medications could be administered only when needed to prevent impending seizures, reducing overall side effects.
A Crowd-Sourcing Ecosystem for Seizure Prediction
Epilepsyecosystem.org is the evolution of a Crowd-Sourcing Ecosystem for Seizure Prediction that began with the ‘Melbourne-University AES-MathWorks-NIH Seizure Prediction Challenge’ that was hosted on Kaggle.com in 2016. The contest focused on seizure prediction using long-term electrical brain activity recordings from humans obtained from the world-first clinical trial of the implantable NeuroVista Seizure Advisory System. Over 10,000 algorithms were submitted. The top algorithms from the contest were evaluated on additional held out data and demonstrated improvements in seizure prediction performance relative to the original trial results. Epilepsyecosystem.org offers the opportunity to yield further improvements with the contest dataset. The top algorithms in the ecosystem will be invited for evaluation on the full clinical trial database, a one-of-a-kind world-class dataset, with the aim of finding the best algorithms for the widest range of patients.
Acknowledgments
Epilepsyecosystem.org is supported by the Aikenhead Centre for Medical Discovery at St. Vincent’s Hospital Melbourne, the University of Melbourne, Swinburne University of Technology and Seer Medical.
References
Kuhlmann, L., Karoly, P., Freestone, D.R., Brinkmann, B.H., Temko, A., Barachant, A., Li, F., Titericz Jr., G., Lang, B.W., Lavery, D., Roman, K., Broadhead, D., Dobson, S., Jones, G., Tang, Q., Ivanenko, I., Panichev, O., Proix, T., Náhlík, M., Grunberg, D.B., Reuben, C., Worrell, G., Litt, B., Liley, D.T.J., Grayden, D.B., & Cook, M.J. (2018) Epilepsyecosystem.org: Crowd-Sourcing Reproducible Seizure Prediction with Long-Term Human Intracranial EEG, Brain, awy210, https://doi.org/10.1093/brain/awy210
Kuhlmann, L., Lehnertz, K., Richardson, M., Schelter, B., & Zaveri, H. (2018) Seizure prediction — ready for a new era. Nature Reviews Neurology. In Press (accepted 13/07/2018).
Freestone, D. R., Karoly, P. J., Peterson, A. D., Kuhlmann, L., Lai, A., Goodarzy, F., & Cook, M. J. (2015). Seizure prediction: science fiction or soon to become reality?. Current neurology and neuroscience reports15(11), 73.
Karoly, P. J., Ung, H., Grayden, D. B., Kuhlmann, L., Leyde, K., Cook, M. J., & Freestone, D. R. (2017). The circadian profile of epilepsy improves seizure forecasting. Brain140(8), 2169-2182.
Cook MJ, O'Brien TJ, Berkovic SF, Murphy M, Morokoff A, Fabinyi G, D'Souza W, Yerra R, Archer J, Litewka L, Hosking S, Lightfoot P, Ruedebusch V, Sheffield WD, Snyder D, Leyde K, Himes D (2013) Prediction of seizure likelihood with a long-term, implanted seizure advisory system in patients with drug-resistant epilepsy: a first-in-man study. Lancet Neurology 12:563-571.
Brinkmann, B. H., Wagenaar, J., Abbot, D., Adkins, P., Bosshard, S. C., Chen, M., ... & Pardo, J. (2016). Crowdsourcing reproducible seizure forecasting in human and canine epilepsy. Brain, 139(6), 1713-1722.
Gadhoumi, K., Lina, J. M., Mormann, F., & Gotman, J. (2016). Seizure prediction for therapeutic devices: A review. Journal of neuroscience methods, 260, 270-282.
Andrzejak RG, Chicharro D, Elger CE, Mormann F (2009) Seizure prediction: Any better than chance? Clin Neurophysiol.
Mormann F, Andrzejak RG, Elger CE, Lehnertz K (2007) Seizure pre

Friday, December 21, 2018

Homemade Machine Learning

https://github.com/trekhleb/homemade-machine-learning

The purpose of this repository is not to implement machine learning algorithms by using 3rd party library "one-liners" but rather to practice implementing these algorithms from scratch and get better understanding of the mathematics behind each algorithm. That's why all algorithms implementations are called "homemade" and not intended to be used for production.

Saturday, November 24, 2018

Measuring the Effects of Data Parallelism on Neural Network Training

Important paper from Google on large batch optimization. They do impressively careful experiments measuring # iterations needed to achieve target validation error at various batch sizes. The main "surprise" is the lack of surprises. [thread]

https://arxiv.org/abs/1811.03600





 

Monday, October 29, 2018

New schemes teach the masses to build AI

https://www.economist.com/business/2018/10/27/new-schemes-teach-the-masses-to-build-ai

https://news.ycombinator.com/item?id=18320927

VER THE past five years researchers in artificial intelligence have become the rock stars of the technology world. A branch of AI known as deep learning, which uses neural networks to churn through large volumes of data looking for patterns, has proven so useful that skilled practitioners can command high six-figure salaries to build software for Amazon, Apple, Facebook and Google. The top names can earn over $1m a year.

The standard route into these jobs has been a PhD in computer science from one of America’s elite universities. Earning one takes years and requires a disposition suited to academia, which is rare among more normal folk. Graduate students are regularly lured away from their studies by lucrative jobs.

That is changing. This month fast.ai, an education non-profit based in San Francisco, kicked off the third year of its course in deep learning. Since its inception it has attracted more than 100,000 students, scattered around the globe from India to Nigeria. The course and others like it come with a simple proposition: there is no need to spend years obtaining a PhD in order to practise deep learning. Creating software that learns can be taught as a craft, not as a high intellectual pursuit to be undertaken only in an ivory tower. Fast.ai’s course can be completed in just seven weeks.

Demystifying the subject, to make it accessible to anyone who wants to learn how to build AI software, is the aim of Jeremy Howard, who founded fast.ai with Rachel Thomas, a mathematician. He says school mathematics is sufficient. “No. Greek. Letters,” Mr Howard intones, thumping the table for punctuation.

It is working. A graduate from fast.ai’s first year, Sara Hooker, was hired into Google’s highly competitive AI residency programme after finishing the course, having never worked on deep learning before. She is now a founding member of Google’s new AI research office in Accra, Ghana, the firm’s first in Africa. In Bangalore, some 2,400 people are members of AI Saturdays, which follows the course together as a gigantic study group. Andrei Karpathy, one of deep learning’s foremost practitioners, recommends the course.

Fast.ai’s is not the only alternative AI programme. AI4ALL, another non-profit venture, works to bring AI education to schoolchildren in the United States that would otherwise not have access to it. Andrew Ng, another well-known figure in the field, has started his own online course, deeplearning.ai.

Mr Howard’s ambitions run deeper than loosening the AI labour market. His aim is to spread deep learning into many hands, so that it may be applied in as diverse a set of fields by as diverse a group of people as possible. So far, it has been controlled by a small number of mostly young white men, almost all of whom have been employed by the tech giants. The ambition, says Mr Howard, is for AI training software to become as easy to use and ubiquitous as sending an email on a smartphone.

Some experts worry that this will serve only to create a flood of dodgy AI systems which will be useless at best and dangerous at worst. An analogy may allay those concerns. In the earliest days of the internet, only a select few nerds with specific skills could build applications. Not many people used them. Then the invention of the world wide web led to an explosion of web pages, both good and bad. But it was only by opening up to all that the internet gave birth to online shopping, instant global communications and search. If Mr Howard and others have their way, making the development of AI software easier will bring forth a new crop of fruit of a different kind.
This article appeared in the Business section of the print edition under the headline "Learning, fast and deep"

The Data Science of K-Pop: Understanding BTS through data and A.I.

https://towardsdatascience.com/the-data-science-of-k-pop-understanding-bts-through-data-and-a-i-part-1-50783b198ac2

Sunday, October 28, 2018

ESP-WHO

ESP-WHO is a face detection and recognition platform that is currently based on Espressif Systems' ESP32 chip. ESP-WHO is within Espressif Systems AI framework, with the focus on face detection and recognition so far.

https://github.com/espressif/esp-who

Sunday, October 21, 2018

Three Principles for Successful AI Solutions

https://blog.f-secure.com/three-principles-successful-ai-solutions/

Everyone is talking about Artificial Intelligence, but what’s really going on? Is AI going to solve all your problems? And what even IS AI? What are the key things you need to take into account when building AI solutions? In practice AI is often used as an umbrella term, and most of the solutions we refer to are Machine Learning – and it has been around for some time. For example, the first artificial neural networks emerged in the mid 1900s.
Sure, methods and algorithms have evolved a lot, and perhaps even more significantly computing power has increased drastically, but we have been using and developing machine learning models for quite some time. Even though there have been fantastic recent advances in ease of use through ready toolkits like TensorFlow or manged cloud services like Sagemaker, I do think many of the fundamental things learned through experience over the years still apply, but have sometimes seen some of the fundamentals can get forgotten in all the excitement. I have been working in the field for a while now, and just wanted to share a few principles that I think everyone should keep in mind when designing and building – or evaluating, if you are sitting on that side of the fence – successful AI/ML solutions.

Saturday, October 20, 2018

Raster Vision: A New Open Source Framework for Deep Learning on Satellite and Aerial Imagery

An open source framework for deep learning on satellite and aerial imagery.

https://github.com/azavea/raster-vision

What is Raster Vision?

Raster Vision is an open source framework for Python developers building computer vision models on satellite, aerial, and other large imagery sets, including non-georeferenced data like oblique drone imagery. It allows engineers to quickly and repeatably configure experiments that go through core components of a machine learning workflow: analyzing training data, creating training chips, training models, creating predictions, evaluating models, and bundling the model files and configuration for easy deployment.
The input to a Raster Vision workflow is a set of images and training data, optionally with Areas of Interest (AOIs), that describe where the images are labeled. Running a workflow results in evaluation metrics and a packaged model and configuration that enables easy deployment. Raster Vision also supports running multiple experiments at once to find the best model (or models) to deploy.

Monday, October 15, 2018

raining Neural Nets on Larger Batches: Practical Tips for 1-GPU, Multi-GPU & Distributed setups

https://medium.com/huggingface/training-larger-batches-practical-tips-on-1-gpu-multi-gpu-distributed-setups-ec88c3e51255

Travelling Salesman Challenge 2.0

Travelling Salesman Challenge 2.0


https://travellingsalesman.kiwi.com/

Win a trip around the world based on an algorithm you write!
This is your chance to get your hands on something that’s normally purely theoretical and have some fun with it. The Travelling Salesman Challenge is returning for 2018, and this time you’re finding the cheapest route between whole areas.

Monday, October 8, 2018

AI and the News: An Open Challenge

https://aiethicsinitiative.org/challenge


Artificial intelligence — and its subfield of machine learning — is reshaping the landscape of news and information. From the algorithms filtering what we see on social media, to the use of machine learning to generate news stories and online content, AI has and will continue to play a major role in shaping what and how information is distributed and consumed.
As researchers and companies continue to advance the technical state of the art, we believe that it is necessary to ensure that AI serves the public good. This means not only working to address the problems presented by existing AI systems, but articulating what realistic, better alternatives might look like.
This open challenge, which will award up to $750,000 to a range of projects, is seeking fresh and experimental approaches to four specific problems at the intersection of AI and the news:
  • Governing the Platforms: Ensuring that AI serves the public good requires the public to know how the platforms are deploying these technologies and how they shape the flow of information through the web today. However, as many others have pointed out, the level of transparency and accountability around these decisions has been limited, and we’re seeking ideas that help to raise it. This might be new policies in the form of draft legislation, or technical tools that help keep an eye on the information ecosystem. 
  • Stopping Bad Actors: AI might be applied by a variety of actors to spread disinformation, from powering believable bots on social media to fabricating realistic video and audio. This exacerbates a range of existing problems in news and information. We’re seeking approaches we can take to detect and counter this next generation of propaganda. 
  • Empowering Journalism: Journalists play a major role in shaping public understanding of AI, its impact on the information ecosystem, and what we should do to ensure the technology is used ethically. But it can be hard to keep up with the latest developments in the technical research and communicate them effectively to society at large. We’re seeking ideas that will help bolster this community in this important work, and give them the tools they need to effectively communicate about AI and its impact. 
  • Reimagining AI and News: It is easy to find a lot of things to critique about the influence that automation and AI have on the news and information space. More challenging is articulating plausible alternatives for how these platforms should be designed and how they should deploy these technologies. We’re interested in ideas that paint a picture of the future: How might platforms from smartphones and social media sites to search engines and online news outlets be redesigned in part or entirely to better serve the public good? 
We believe there are a diverse range of communities that can bring their expertise to bear on these issues but are frequently left out of the conversation. This challenge is open to anyone: We’re looking for journalists, designers, technologists, activists, entrepreneurs, artists, lawyers from a variety of communities around the world — anyone who thinks they have a good idea about addressing these problems that may not have been tried before.

PacktPub Free Learning

https://www.packtpub.com/packt/offers/free-learning/

Friday, September 28, 2018

Why Deep Learning Is Suddenly Changing Your Life

Why Deep Learning Is Suddenly Changing Your Life

 


http://fortune.com/ai-artificial-intelligence-deep-machine-learning

 

Decades-old discoveries are now electrifying the computing industry and will soon transform corporate America.

Over the past four years, readers have doubtlessly noticed quantum leaps in the quality of a wide range of everyday technologies.
Most obviously, the speech-recognition functions on our smartphones work much better than they used to. When we use a voice command to call our spouses, we reach them now. We aren’t connected to Amtrak or an angry ex.
In fact, we are increasingly interacting with our computers by just talking to them, whether it’s Amazon’s Alexa, Apple’s Siri, Microsoft’s Cortana, or the many voice-responsive features of Google. Chinese search giant Baidu says customers have tripled their use of its speech interfaces in the past 18 months.
Machine translation and other forms of language processing have also become far more convincing, with Google GOOGL 0.92% , Microsoft MSFT 0.38% , Facebook FB 1.13% , and Baidu BIDU 0.35% unveiling new tricks every month. Google Translate now renders spoken sentences in one language into spoken sentences in another for 32 pairs of languages, while offering text translations for 103 tongues, including Cebuano, Igbo, and Zulu. Google’s Inbox app offers three ready-made replies for many incoming emails.

Using Confusion Matrices to Quantify the Cost of Being Wrong

Using Confusion Matrices to Quantify the Cost of Being Wrong

 


https://www.datasciencecentral.com/profiles/blogs/using-confusion-matrices-to-quantify-the-cost-of-being-wrong

Tuesday, September 25, 2018

Python Machine Learning Book 2nd Edition

https://github.com/rasbt/python-machine-learning-book-2nd-edition

Python Machine Learning, 2nd Ed.
published September 20th, 2017
Paperback: 622 pages
Publisher: Packt Publishing
Language: English
ISBN-10: 1787125939
ISBN-13: 978-1787125933
Kindle ASIN: B0742K7HYF

Links

Table of Contents and Code Notebooks

Helpful installation and setup instructions can be found in the README.md file of Chapter 1
To access the code materials for a given chapter, simply click on the open dir links next to the chapter headlines to navigate to the chapter subdirectories located in the code/ subdirectory. You can also click on the ipynb links below to open and view the Jupyter notebook of each chapter directly on GitHub.
In addition, the code/ subdirectories also contain .py script files, which were created from the Jupyter Notebooks. However, I highly recommend working with the Jupyter notebook if possible in your computing environment. Not only do the Jupyter notebooks contain the images and section headings for easier navigation, but they also allow for a stepwise execution of individual code snippets, which -- in my opinion -- provide a better learning experience.
Please note that these are just the code examples accompanying the book, which I uploaded for your convenience; be aware that these notebooks may not be useful without the formulae and descriptive text.
  1. Machine Learning - Giving Computers the Ability to Learn from Data [open dir] [ipynb]
  2. Training Machine Learning Algorithms for Classification [open dir] [ipynb]
  3. A Tour of Machine Learning Classifiers Using Scikit-Learn [open dir] [ipynb]
  4. Building Good Training Sets – Data Pre-Processing [open dir] [ipynb]
  5. Compressing Data via Dimensionality Reduction [open dir] [ipynb]
  6. Learning Best Practices for Model Evaluation and Hyperparameter Optimization [open dir] [ipynb]
  7. Combining Different Models for Ensemble Learning [open dir] [ipynb]
  8. Applying Machine Learning to Sentiment Analysis [open dir] [ipynb]
  9. Embedding a Machine Learning Model into a Web Application [open dir] [ipynb]
  10. Predicting Continuous Target Variables with Regression Analysis [open dir] [ipynb]
  11. Working with Unlabeled Data – Clustering Analysis [open dir] [ipynb]
  12. Implementing a Multi-layer Artificial Neural Network from Scratch [open dir] [ipynb]
  13. Parallelizing Neural Network Training with TensorFlow [open dir] [ipynb]
  14. Going Deeper: The Mechanics of TensorFlow [open dir] [ipynb]
  15. Classifying Images with Deep Convolutional Neural Networks [open dir] [ipynb]
  16. Modeling Sequential Data Using Recurrent Neural Networks [open dir] [ipynb]

Monday, September 24, 2018

Keras callbacks guide and code

https://keunwoochoi.wordpress.com/2016/07/16/keras-callbacks/
the code is outdated for latest Keras, but still useful
updated version in https://github.com/keunwoochoi/keras_callbacks_example

Open Machine Learning Course mlcourse.ai

https://mlcourse.ai/

mlcourse.ai is an open Machine Learning course by OpenDataScience. The course is designed to perfectly balance theory and practice; therefore, each topic is followed by an assignment with a deadline in a week. You can also take part in several Kaggle Inclass competitions held during the course and work on your own projects.
Next session starts on October 1, 2018. Fill in this form to participate.
Navigation:
  • Prerequisites. Our course is not for total newbies. Though Machine Learning is covered from scratch, still participants are expected to know some math and be able to write code in Python.
  • Assignments. Here you’ll find demo versions of assignments. Assignments in a new session of the course will be different.
  • News. Here you can track main announcements during the course.
  • Resources. Links to other information mirrors of this course like Medium stories, Kaggle Kernels etc.
  • Contacts. Ways of reaching OpenDataScience and course team.
  • Support. Various ways in which you can help mlcourse.ai to grow.




Wednesday, August 1, 2018

Ten Techniques Learned From fast.ai

 
https://blog.floydhub.com/ten-a-from-fast-ai/

Right now, Jeremy Howard – the co-founder of fast.ai – currently holds the 105th highest score for the plant seedling classification contest on Kaggle, but he's dropping fast. Why? His own students are beating him. And their names can now be found across the tops of leaderboards all over Kaggle.
When I left you I was but the learner, but now I am the master.
So what are these secrets that are allowing novices to implement world-class algorithms in mere weeks, leaving behind experienced deep learning practitioners in their GPU-powered wake? Allow me to tell you in ten simple steps.
Read on if you're already practicing deep learning and want to quickly get an overview of the powerful techniques that fast.ai uses in their courses. Read on if you've already completed fast.ai and want to recap some of what you were supposed to have already learned. Read on if you're flirting with the idea of studying deep learning, and would like to see how the field is evolving and what fast.ai can offer beginners.
Now, before we begin, you should know that you'll need access to GPUs to run fast.ai content effectively. For my fast.ai projects, I've been using FloydHub. After much experimentation and research with other cloud-based solutions, I've found FloydHub is the best and easiest way to train deep learning models on cloud GPUs. I love being able to easily keep track of my experiments in Projects on FloydHub, making it especially easy to visualize and filter which models are performing best. They also have the simplest solution for managing (and automatically versioning) your datasets, which you'll learn is going to be super valuable down the road in any DL project.
Okay, let's get started.

1. Use the Fast.ai library

from fast.ai import *
The fast.ai library is not only a toolkit to get newbies quickly implementing deep learning, but a powerful and convenient source of current best practices. Each time the fast.ai team (and their network of AI researchers & collaborators) finds a particularly interesting paper, they test it out on a variety of datasets and work out how to tune it. If they are successful, it gets implemented in the library, and the technology can be quickly accessed by its users.
The result is a powerful toolbox, including quick access to best-current practices such as stochastic gradient descent with restarts, differential learning rates, and test-time augmentation (not to mention many more).
Each of these techniques will be described below, and we will show how you can rapidly implement them using the fast.ai library. The library is built upon PyTorch, and you can use them together quite fluidly. To get going with the library on FloydHub, check out their 2-min installation.

2. Don’t use one learning rate, use many

Differential Learning rates mean higher layers change more than deeper layers during training. Building deep learning models on top of pre-existing architectures is a proven method to generate much better results in computer vision tasks.
Most of these architectures (i.e. Resnet, VGG, inception…) are trained on ImageNet, and depending on the similarity of your data to the images on ImageNet, these weights will need to be altered more or less greatly. When it comes to modifying these weights, the last layers of the model will often need the most changing, while deeper levels that are already well trained to detecting basic features (such as edges and outlines) will need less.
So firstly, to get a pre-trained model with the fast ai library use the following code:
from fastai.conv_learner import *

# import library for creating learning object for convolutional #networks
model = VVG16()

# assign model to resnet, vgg, or even your own custom model
PATH = './folder_containing_images' 
data = ImageClassifierData.from_paths(PATH)

# create fast ai data object, in this method we use from_paths where 
# inside PATH each image class is separated into different folders

learn = ConvLearner.pretrained(model, data, precompute=True)

# create a learn object to quickly utilise state of the art
# techniques from the fast ai library
With the learn object now created, we can solve the problem of only finely tuning the last layers by quickly freezing the first layers:
learn.freeze()

# freeze layers up to the last one, so weights will not be updated.

learning_rate = 0.1
learn.fit(learning_rate, epochs=3)

# train only the last layer for a few epochs
Once the last layers are producing good results, we implement differential learning rates to alter the lower layers as well. The lower layers want to be altered less, so it is good practice to set each learning rate to be 10 times lower than the last:
learn.unfreeze()

# set requires_grads to be True for all layers, so they can be updated

learning_rate = [0.001, 0.01, 0.1]
# learning rate is set so that deepest third of layers have a rate of 0.001, # middle layers have a rate of 0.01, and final layers 0.1.

learn.fit(learning_rate, epochs=3)
# train model for three epoch with using differential learning rates

3. How to find the right learning rate

The learning rate is the most important hyper-parameter for training neural networks, yet until recently deciding its value has been incredibly hacky. Leslie Smith may have stumbled upon the answer in his paper on cyclical learning rates; a relatively unknown discovery until it was promoted by the fast.ai course.
In this method, we do a trial run and train the neural network using a low learning rate, but increase it exponentially with each batch. This can be done with the following code:
learn.lr_find()
# run on learn object where learning rate is increased  exponentially

learn.sched.plot_lr()
# plot graph of learning rate against iterations
The learning rate is increased exponentially with every iteration
Meanwhile, the loss is recorded for every value of the learning rate. We then plot loss against learning rate:
learn.sched.plot()
# plots the loss against the learning rate
Find where the loss is still decreasing but has not plateaued.
The optimum learning rate is determined by finding the value where the learning rate is highest and the loss is still descending, in the above case about this value would be 0.01.

4. Cosine annealing

With each batch of stochastic gradient descent (SGD), your network should be getting closer and closer to a global minimum value for the loss. As it gets closer to this minimum, it hence makes sense that the learning rate should get smaller so that your algorithm does not overshoot, and instead settles as close to this point as possible. Cosine annealing solves this problem by decreasing the learning rate following the cosine function as seen in the figure below.
As we increase x, we see cosine(x) decrease following this wavy shape.
Looking at the figure above, we see that as we increase x the cosine value descends slowly at first, then more quickly and then slightly slower again. This mode of decreasing works well with the learning rate, yielding great results in a computationally efficient manner.
learn.fit(0.1, 1)
# Calling learn fit automatically takes advantage of cosine annealing
The technique is implemented automatically by the fast ai library when using learn.fit(). The above code would have our learning rate decrease across the epoch as shown in the figure below.
Learning rate decreases across an epoch that takes 200 iterations

However we can go one step further than this even, and introduce restarts

5. Stochastic Gradient Descent with restarts

During training it is possible for gradient descent to get stuck at local minima rather than the global minimum.
Gradient descent can get stuck at local minima
By increasing the learning rate suddenly, gradient descent may “hop” out of the local minima and find its way toward the global minimum. Doing this is called stochastic gradient descent with restarts (SGDR), an idea shown to be highly effective in a paper by Loshchilov and Hutter.
SGDR is also handled for you automatically by the fast ai library. When calling learn.fit(learning_rate, epochs), the learning rate is reset at the start of each epoch to the original value you entered as a parameter, then decreases again over the epoch as described above in cosine annealing.
The learning rate is restored to its original value after each epoch.
Each time the learning rate drops to it’s minimum point (every 100 iterations in the figure above), we call this a cycle.
cycle_len = 1
# decide how many epochs it takes for the learning rate to fall to
# its minimum point. In this case, 1 epoch

cycle_mult=2
# at the end of each cycle, multiply the cycle_len value by 2

learn.fit(0.1, 3, cycle_len=2, cycle_mult=2)
# in this case there will be three restarts. The first time with
# cycle_len of 1, so it will take 1 epoch to complete the cycle.
# cycle_mult=2 so the next cycle with have a length of two epochs, 
# and the next four.
Each cycle taking twice as many epochs to complete as the prior cycle
Playing around with these parameters, along with using differential learning rates, are the key techniques allowing fast ai users to perform so well on image classification problems.
Cycle_mult and cycle_len are discussed here on the fast.ai forum, while the concepts above regarding learning rate can be found explained more fully in this fast ai lesson and this blog post.

6. Anthropomorphise your activation functions

Softmax likes to pick just one thing. Sigmoid wants to know where you are between -1 and 1, and beyond these values won’t care how much you increase. Relu is a club bouncer who won’t let negative numbers through the door.
It may seem silly to treat activation functions in such a manner, but giving them a character ensures not using them for the wrong task. As Jeremy Howard points out, even academic papers often use softmax for multi-class classification, and I too have already seen it used incorrectly in blogs and papers during my short time studying DL.

7. Transfer learning is hugely effective in NLP

Just as using pre-trained models has proven immensely effective in computer vision, it is becoming increasingly clear that natural language processing (NLP) models can benefit from doing the same.
In the 4th lesson of fast.ai, Jeremy Howard builds a model to determine if IMDB reviews are positive or negative using transfer learning. The power of this technique is observed instantly, where the accuracy he achieves beat all previous efforts of the time presented in a paper by Bradbury et al.
Pre-existing architectures deliver state of the art NLP results.
The secret to success lies in training a model firstly to gain some understanding of the language, before using this pre-trained model as a part of a model to analyze sentiment.
To create the first model, a recurrent neural network (RNN) is trained to predict the next word in a sequence of text. This is known as language modeling. Once the network is trained to a high degree of accuracy, its encodings for each word are passed on to a new model that is used for sentiment analysis.
In the example we see this language model being integrated with a model to perform sentiment analysis, but this same method could be used for any NLP task from translation to data extraction.
And again the same principles as above in computer vision apply here, where freezing and using differential learning rates can yield better results.
The implementation of this method for NLP is too detailed for me to share the code in this post, but if you are interested watch the lesson here and access the code here.

8. Deep learning can challenge ML in tackling structured data

Fast.ai shows techniques to rapidly generate great results on structured data without having to resort to feature engineering or apply domain specific knowledge.
Their library makes the most of PyTorch’s embedding functions, allowing rapid conversion of categorical variables into embedding matrixes.
The technique they show is relatively straight forward, and simply involves turning the categorical variables into numbers and then assigning each value an embedding vector:
Each day of the week is given an embedding with four values.
The advantage of doing this compared to the traditional approach of creating dummy variables (i.e. doing one hot encodings), is that each day can be represented by four numbers instead of one, hence we gain higher dimensionality and much richer relationships.
The implementation shown in this lesson (the code here) gained third place in the Rossman Kaggle competition, only beaten by domain experts who had their own code to create many, many extra features.
The idea that using deep learning dramatically reduces need for feature engineering has been confirmed by Pinterest too, who have said this to be the case ever since they switched to deep learning models, gaining state of the art results with a lot less work!

9. A game-winning bundle: building up sizes, dropout and TTA

On the 30th April, the fast.ai team won the DAWNBench competition (run by Stanford University) on Imagenet and CIFAR10 classification. In Jeremy’s write-up of the victory, he credits their success to the little extra touches available in the fast.ai library.
One of these is the concept of Dropout, proposed by Geoffrey Hinton two years ago in this seminal paper. Despite its initial popularity, it seems to be somewhat ignored in recent computer vision papers. However, PyTorch has made its implementation incredibly easy, and with fast ai on top it’s easier than ever.
Blank spaces represent activations knocked out by dropout function.
Dropout combats overfitting and so would have proved crucial in winning on a relatively small dataset such at CIFAR10. Dropout is implemented automatically by fast ai when creating a learn object, though can be altered using the ps variable as shown here:
learn = ConvLearner.pretrained(model, data, ps=0.5, precompute=True)
# creates a dropout of 0.5 (i.e. half the activations) on test dataset. 
# This is automatically turned off for the validation set
For more information on dropout see this video (from 4:57).
Another incredibly simple and effective method they used for tackling overfitting and improving accuracy is training on smaller image sizes, then increasing the size and training the same model on them again.
# create a data object with images of sz * sz pixels 
def get_data(sz): 
	tmfs = tfms_from_model(model, sz)
	# tells what size images should be, additional transformations such
	# image flips and zooms can easily be added here too
 
	data = ImageClassifierData.from_paths(PATH, tfms=tfms)
	# creates fastai data object of create size

	return data

learn.set_data(get_data(299))
# changes the data in the learn object to be images of size 299
# without changing the model.

learn.fit(0.1, 3)
# train for a few epochs on larger versions of images, avoiding overfitting
A final technique that can raise accuracy by one percent or two is test time augmentation (TTA). This involves taking a series of different versions of the original image (for example cropping different areas, or changing the zoom) and passing them through the model. The average output is then calculated for the different versions and this is given as the final output score for the image. It can be called by running learn.TTA().
preds, target = learn.TTA()
This technique is effective as perhaps the original cropping of an image may miss out a vital feature. Providing the model with multiple versions of the picture and taking an average makes this less likely to have an effect.

10. Creativity is key

Not only did the fast.ai team win prizes for fastest speed in the DAWNBench competition, but these same algorithms also won the prize for being cheapest to run. The lesson to be learnt here is that creating successful DL applications is not just a case of chucking huge amounts of GPU at the issue, but should instead be a question of creativity, of intuition and innovation.
Most of the breakthroughs discussed in this article (dropout, cosine annealing, SGD with restarts, the list goes on…) in fact were such exact moments, where someone thought of approaching the problem differently. These approaches then brought increases in accuracy greater than those that would have been achieved by simply throwing another thousand images at the problem with a handful of IBM computers.
So just because there are a lot of big dogs out there with a lot of big GPUs in Silicon Valley, don’t think that you can’t challenge them, or that you can’t create something special or innovative.
In fact, perhaps sometimes you can see constraints as a blessing; after all, necessity is the mother of invention.

About Samuel Lynn-Evans

For the last 10 years, Sam has combined his passion for science and languages by teaching life sciences in foreign countries. Seeing the huge potential for ML in scientific progress, he began studying AI at school 42 in Paris, with the aim of applying NLP to biological and medical problems.
You can follow along with Sam on Twitter and Medium.

Sunday, July 29, 2018

Differentiable Image Parameterizations

https://distill.pub/2018/differentiable-parameterizations/

Neural networks trained to classify images have a remarkable — and surprising! — capacity to generate images. Techniques such as DeepDream
[1]
, style transfer
[2]
, and feature visualization
[3]
leverage this capacity as a powerful tool for exploring the inner workings of neural networks, and to fuel a small artistic movement based on neural art. All these techniques work in roughly the same way. Neural networks used in computer vision have a rich internal representation of the images they look at. We can use this representation to describe the properties we want an image to have (e.g. style), and then optimize the input image to have those properties. This kind of optimization is possible because the networks are differentiable with respect to their inputs: we can slightly tweak the image to better fit the desired properties, and then iteratively apply such tweaks in gradient descent.
Typically, we parameterize the input image as the RGB values of each pixel, but that isn’t the only way. As long as the mapping from parameters to images is differentiable, we can still optimize alternative parameterizations with gradient descent.

Friday, July 27, 2018

Thoughts On Machine Learning Accuracy

https://aws.amazon.com/blogs/aws/thoughts-on-machine-learning-accuracy/

This blog shares some brief thoughts on machine learning accuracy and bias.
Let’s start with some comments about a recent ACLU blog in which they run a facial recognition trial. Using Rekognition, the ACLU built a face database using 25,000 publicly available arrest photos and then performed facial similarity searches of that database using public photos of all current members of Congress. They found 28 incorrect matches out of 535, using an 80% confidence level; this is a 5% misidentification (sometimes called ‘false positive’) rate and a 95% accuracy rate. The ACLU has not published its data set, methodology, or results in detail, so we can only go on what they’ve publicly said. But, here are some thoughts on their claims:
The default confidence threshold for facial recognition APIs in Rekognition is 80%, which is good for a broad set of general use cases (such as identifying celebrities on social media or family members who look alike in a photos app), but it’s not the right one for public safety use cases. The 80% confidence threshold used by the ACLU is far too low to ensure the accurate identification of individuals; we would expect to see false positives at this level of confidence. We recommend 99% for use cases where highly accurate face similarity matches are important (as indicated in our public documentation).
1. To illustrate the impact of confidence threshold on false positives, we ran a test where we created a face collection using a dataset of over 850,000 faces commonly used in academia. We then used public photos of all members of US Congress (the Senate and House) to search against this collection in a similar way to the ACLU blog.
When we set the confidence threshold at 99% (as we recommend in our documentation), our misidentification rate dropped to 0% despite the fact that we are comparing against a larger corpus of faces (30x larger than ACLU’s tests). This illustrates how important it is for those using ‎technology to help with public safety issues to pick appropriate confidence levels, so they have few (if any) false positives.
2. In real-world public safety and law enforcement scenarios, Amazon Rekognition is almost exclusively used to help narrow the field and allow humans to expeditiously review and consider options using their judgment (and not to make fully autonomous decisions), where it can help find lost children, fight against human trafficking, or prevent crimes. Rekognition is generally only the first step in identifying an individual. In other use cases (such as social media), there isn’t the same need to double check so that confidence thresholds can be lower.
3. In addition to setting the confidence threshold far too low, the Rekognition results can be significantly skewed by using a facial database that is not appropriately representative that is itself skewed. In this case, ACLU used a facial database of mugshots that may have had a material impact on the accuracy of Rekognition findings.
4. The advantage of a cloud-based machine learning application like Rekognition is that it is constantly improving as we continue to improve the algorithm with more data. Our customers immediately get the benefit of those improvements. We continue to focus on our mission of making Rekognition the most accurate and powerful tool for identifying people, objects, and scenes – and that certainly includes ensuring that the results are free of any bias that impacts accuracy.  We’ve been able to add a lot of value for customers and the world at large already with Rekognition in the fight against human trafficking, reuniting lost children with their families, reducing fraud for mobile payments, and improving security, and we’re excited about continuing to help our customers and society at large with Rekognition in the future.
5. There is a general misconception that people can match faces to photos better than machines. In fact, the National Institute for Standards and Technology (“NIST”) recently shared a study of facial recognition technologies that are at least two years behind the state of the art used in Rekognition and concluded that even those older technologies can outperform human facial recognition abilities.
A final word about the misinterpreted ACLU results. When there are new technological advances, we all have to clearly understand what’s real and what’s not. There’s a difference between using machine learning to identify a food object and using machine learning to determine whether a face match should warrant considering any law enforcement action. The latter is serious business and requires much higher confidence levels. We continue to recommend that customers do not use less than 99% confidence levels for law enforcement matches, and then to only use the matches as one input across others that make sense for each agency. But, machine learning is a very valuable tool to help law enforcement agencies, and while being concerned it’s applied correctly, we should not throw away the oven because the temperature could be set wrong and burn the pizza. It is a very reasonable idea, however, for the government to weigh in and specify what temperature (or confidence levels) it wants law enforcement agencies to meet to assist in their public safety work.
Dr. Matt Wood

Dr. Matt Wood

Neural Networks Are Essentially Polynomial Regression

https://matloff.wordpress.com/2018/06/20/neural-networks-are-essentially-polynomial-regression/

You may be interested in my new arXiv paper, joint work with Xi Cheng, an undergraduate at UC Davis (now heading to Cornell for grad school); Bohdan Khomtchouk, a post doc in biology at Stanford; and Pete Mohanty,  a Science, Engineering & Education Fellow in statistics at Stanford. The paper is of a provocative nature, and we welcome feedback.
A summary of the paper is:
  • We present a very simple, informal mathematical argument that neural networks (NNs) are in essence polynomial regression (PR). We refer to this as NNAEPR.
  • NNAEPR implies that we can use our knowledge of the “old-fashioned” method of PR to gain insight into how NNs — widely viewed somewhat warily as a “black box” — work inside.
  • One such insight is that the outputs of an NN layer will be prone to multicollinearity, with the problem becoming worse with each successive layer. This in turn may explain why convergence issues often develop in NNs. It also suggests that NN users tend to use overly large networks.
  • NNAEPR suggests that one may abandon using NNs altogether, and simply use PR instead.
  • We investigated this on a wide variety of datasets, and found that in every case PR did as well as, and often better than, NNs.
  • We have developed a feature-rich R package, polyreg, to facilitate using PR in multivariate settings.
Much work remains to be done (see paper), but our results so far are very encouraging. By using PR, one can avoid the headaches of NN, such as selecting good combinations of tuning parameters, dealing with convergence problems, and so on.
Also available are the slides for our presentation at GRAIL on this project.

Wednesday, July 18, 2018

Foundations of Machine Learning

https://bloomberg.github.io/foml/#about

Understand the Concepts, Techniques and Mathematical Frameworks Used by Experts in Machine Learning

About This Course

Bloomberg presents "Foundations of Machine Learning," a training course that was initially delivered internally to the company's software engineers as part of its "Machine Learning EDU" initiative. This course covers a wide variety of topics in machine learning and statistical modeling. The primary goal of the class is to help participants gain a deep understanding of the concepts, techniques and mathematical frameworks used by experts in machine learning. It is designed to make valuable machine learning skills more accessible to individuals with a strong math background, including software developers, experimental scientists, engineers and financial professionals.
The 30 lectures in the course are embedded below, but may also be viewed in this YouTube playlist. The course includes a complete set of homework assignments, each containing a theoretical element and implementation challenge with support code in Python, which is rapidly becoming the prevailing programming language for data science and machine learning in both academia and industry. This course also serves as a foundation on which more specialized courses and further independent study can build.
Check back soon for how to register for our Piazza discussion board. Common questions from previous editions of the course are posted in our FAQ.
The first lecture, Black Box Machine Learning, gives a quick start introduction to practical machine learning and only requires familiarity with basic programming concepts.

Prerequisites

The quickest way to see if the mathematics level of the course is for you is to take a look at this mathematics assessment, which is a preview of some of the math concepts that show up in the first part of the course.
  • Solid mathematical background, equivalent to a 1-semester undergraduate course in each of the following: linear algebra, multivariate differential calculus, probability theory, and statistics. The content of NYU's DS-GA-1002: Statistical and Mathematical Methods would be more than sufficient, for example.
  • Python programming required for most homework assignments.
  • Recommended: At least one advanced, proof-based mathematics course
  • Recommended: Computer science background up to a "data structures and algorithms" course


Thursday, July 12, 2018

What’s New in Deep Learning Research: Understanding DeepMind’s IMPALA

https://towardsdatascience.com/whats-new-in-deep-learning-research-understanding-deepmind-s-impala-4fbfa5d0ad0c

Deep reinforcement learning has rapidly become one of the hottest research areas in the deep learning ecosystem. The fascination with reinforcement learning is related to the fact that, from all the deep learning modalities, is the one that resemble the most how humans learn. In the last few years, no company in the world has done more to advance the stage of deep reinforcement learning than Alphabet’s subsidiary DeepMind.
Since the launch of its famous AlphaGo agent, DeepMind has been at the forefront of reinforcement learning research. A few days ago, they published a new research that attempts to tackle one of the most challenging aspects of reinforcement learning solutions: multi-tasking.
Since we are infants, multi-tasking becomes an intrinsic element of our cognition. The ability to performing and learning similar tasks concurrently is essential to the development of the human mind. From the neuroscientific standpoint, multi-tasking remains largely a mystery and that, not surprisingly, we have had a heck of hard time implementing artificial intelligence(AI) agents that can efficiently learn multiple domains without requiring a disproportional amount of resources. This challenge is even more evident in the case of deep reinforcement learning models that are based on trial and error exercises which can easily cross the boundaries of a single domain. Biologically speaking, you can argue that all learning is a multi-tasking exercise.
Let’s take a classic deep reinforcement learning scenario such as self-driving vehicles. In that scenarios, AI agents need to concurrently learn different aspects such as distance, memory or navigation while operating under rapidly changing parameters such as vision quality or speed. Most reinforcement learning methods today are focused on learning a single task and the models that track multi-task learning are too difficult to scale to be practical.
In their recent research the DeepMind team proposed a new architecture for deep reinforcement multi-task learning called Importance Weighted Actor-Learner Architecture (IMPALA). Inspired by another popular reinforcement learning architecture called A3C, IMPALA leverages a topology of different actors and learners that can collaborate to build knowledge across different domains. Traditionally, deep reinforcement learning models use an architecture based on a single learner combined with multiple actors. In that model, the Each actor generates trajectories and sends them via a queue to the learner. Before starting the next trajectory, actor retrieves the latest policy parameters from learner. IMPALA uses an architecture that collect experience which is passed to a central learner that computes gradients, resulting in a model that has completely independent actors and learners. This simple architecture enables the learner(s) to be accelerated using GPUs and actors to be easily distributed across many machines.
In addition to the multi-actor architecture model, the IMPALA research also introduces a new algorithm called V-Trace that focuses off-policy learning. The idea of V-Trace is to mitigate the lag between when actions are generated by the actors and when the learner estimates the gradient.
The DeepMind team tested IMPALA on different scenarios using its famous DMLab-30 training set and the results were impressive. IMPALA proved to achieve better performance compared to A3C variants in terms of data efficiency, stability and final performance. This might be the first deep reinforcement learning models that has been able to efficiently operate in multi-task environments.

Tuesday, July 10, 2018

Places: A 10 million image database for scene recognition


The Places dataset is designed following principles of human visual cognition. Our goal is to build a core of visual knowledge that can be used to train artificial systems for high-level visual understanding tasks, such as scene context, object recognition, action and event prediction, and theory-of-mind inference. The semantic categories of Places are defined by their function: the labels represent the entry-level of an environment. To illustrate, the dataset has different categories of bedrooms, or streets, etc, as one does not act the same way, and does not make the same predictions of what can happen next, in a home bedroom, an hotel bedroom or a nursery.
In total, Places contains more than 10 million images comprising 400+ unique scene categories. The dataset features 5000 to 30,000 training images per class, consistent with real-world frequencies of occurrence. Using convolutional neural networks (CNN), Places dataset allows learning of deep scene features for various scene recognition tasks, with the goal to establish new state-of-the-art performances on scene-centric benchmarks. Here we provide the Places Database and the trained CNNs for academic research and education purposes.

http://places2.csail.mit.edu/

http://places2.csail.mit.edu/download.html

Monday, July 9, 2018

Big Data Small Machine

https://adamdrake.com/big-data-small-machine.html

Introduction

I was honored to be invited by DevTO to give a talk at their May meetup. The organizers were keen to have someone speak about high-performance machine learning, and I was happy to oblige.
The general thesis of the talk is that, for the purposes of machine learning, setting up large compute clusters is wholly unnecessary. Furthermore, it should generally be considered harmful as those efforts are extremely time consuming and detract from solving the actual machine learning problem at hand.
To illustrate the point, I showed an online learning approach to binary classification problems using logistic regression with adaptive learning rates. While some might dismiss this approach as too simplistic or ineffective, consider that it is not very different from what Google was (is?) using for some of their online advertising prediction systems. This was described in the wonderful paper Ad Click Prediction: a View from the Trenches.
As in previous summaries of my lectures, I’ll reference select slides by section header and provide the explanation that went along with the slide, including some elaboration I may not have had time for in the lecture itself.

Claims

In my lecture I made a few general claims:
  • RAM in machines used to process data is growing more quickly than the data itself
  • There are many techniques for dealing with so-called Big Data and none of which involve clusters or heavy data infrastructure components like Kafka, Hadoop, Spark, and so on
  • One machine is fine for machine learning tasks, i.e., actually training ML models

How well do IBM, Microsoft, and Face++ AI services guess the gender of a face?

How well do IBM, Microsoft, and Face++ AI services guess the gender of a face?


http://gendershades.org/index.html

Sunday, July 8, 2018

Semantic segmentation on aerial and satellite imagery

https://github.com/mapbox/robosat

Semantic segmentation on aerial and satellite imagery. Extracts features such as: buildings, parking lots, roads, water


RoboSat is an end-to-end pipeline written in Python 3 for feature extraction from aerial and satellite imagery. Features can be anything visually distinguishable in the imagery for example: buildings, parking lots, roads, or cars.
Have a look at
The tools RoboSat comes with can be categorized as follows:
  • data preparation: creating a dataset for training feature extraction models
  • training and modeling: segmentation models for feature extraction in images
  • post-processing: turning segmentation results into cleaned and simple geometries
Tools work with the Slippy Map tile format to abstract away geo-referenced imagery behind tiles of the same size.

Friday, July 6, 2018

A Tour of The Top 10 Algorithms for Machine Learning Newbies

A Tour of The Top 10 Algorithms for Machine Learning Newbies

In machine learning, there’s something called the “No Free Lunch” theorem. In a nutshell, it states that no one algorithm works best for every problem, and it’s especially relevant for supervised learning (i.e. predictive modeling).
For example, you can’t say that neural networks are always better than decision trees or vice-versa. There are many factors at play, such as the size and structure of your dataset.
As a result, you should try many different algorithms for your problem, while using a hold-out “test set” of data to evaluate performance and select the winner.
Of course, the algorithms you try must be appropriate for your problem, which is where picking the right machine learning task comes in. As an analogy, if you need to clean your house, you might use a vacuum, a broom, or a mop, but you wouldn’t bust out a shovel and start digging.

The Big Principle

However, there is a common principle that underlies all supervised machine learning algorithms for predictive modeling.
Machine learning algorithms are described as learning a target function (f) that best maps input variables (X) to an output variable (Y): Y = f(X)
This is a general learning task where we would like to make predictions in the future (Y) given new examples of input variables (X). We don’t know what the function (f) looks like or its form. If we did, we would use it directly and we would not need to learn it from data using machine learning algorithms.
The most common type of machine learning is to learn the mapping Y = f(X) to make predictions of Y for new X. This is called predictive modeling or predictive analytics and our goal is to make the most accurate predictions possible.
For machine learning newbies who are eager to understand the basic of machine learning, here is a quick tour on the top 10 machine learning algorithms used by data scientists.

1 — Linear Regression

Linear regression is perhaps one of the most well-known and well-understood algorithms in statistics and machine learning.
Predictive modeling is primarily concerned with minimizing the error of a model or making the most accurate predictions possible, at the expense of explainability. We will borrow, reuse and steal algorithms from many different fields, including statistics and use them towards these ends.
The representation of linear regression is an equation that describes a line that best fits the relationship between the input variables (x) and the output variables (y), by finding specific weightings for the input variables called coefficients (B).
Linear Regression
For example: y = B0 + B1 * x
We will predict y given the input x and the goal of the linear regression learning algorithm is to find the values for the coefficients B0 and B1.
Different techniques can be used to learn the linear regression model from data, such as a linear algebra solution for ordinary least squares and gradient descent optimization.
Linear regression has been around for more than 200 years and has been extensively studied. Some good rules of thumb when using this technique are to remove variables that are very similar (correlated) and to remove noise from your data, if possible. It is a fast and simple technique and good first algorithm to try.

2 — Logistic Regression

Logistic regression is another technique borrowed by machine learning from the field of statistics. It is the go-to method for binary classification problems (problems with two class values).
Logistic regression is like linear regression in that the goal is to find the values for the coefficients that weight each input variable. Unlike linear regression, the prediction for the output is transformed using a non-linear function called the logistic function.
The logistic function looks like a big S and will transform any value into the range 0 to 1. This is useful because we can apply a rule to the output of the logistic function to snap values to 0 and 1 (e.g. IF less than 0.5 then output 1) and predict a class value.
Logistic Regression
Because of the way that the model is learned, the predictions made by logistic regression can also be used as the probability of a given data instance belonging to class 0 or class 1. This can be useful for problems where you need to give more rationale for a prediction.
Like linear regression, logistic regression does work better when you remove attributes that are unrelated to the output variable as well as attributes that are very similar (correlated) to each other. It’s a fast model to learn and effective on binary classification problems.

3 — Linear Discriminant Analysis

Logistic Regression is a classification algorithm traditionally limited to only two-class classification problems. If you have more than two classes then the Linear Discriminant Analysis algorithm is the preferred linear classification technique.
The representation of LDA is pretty straight forward. It consists of statistical properties of your data, calculated for each class. For a single input variable this includes:
  1. The mean value for each class.
  2. The variance calculated across all classes.
Linear Discriminant Analysis
Predictions are made by calculating a discriminate value for each class and making a prediction for the class with the largest value. The technique assumes that the data has a Gaussian distribution (bell curve), so it is a good idea to remove outliers from your data before hand. It’s a simple and powerful method for classification predictive modeling problems.

4 — Classification and Regression Trees

Decision Trees are an important type of algorithm for predictive modeling machinelearning.
The representation of the decision tree model is a binary tree. This is your binary tree from algorithms and data structures, nothing too fancy. Each node represents a single input variable (x) and a split point on that variable (assuming the variable is numeric).
Decision Tree
The leaf nodes of the tree contain an output variable (y) which is used to make a prediction. Predictions are made by walking the splits of the tree until arriving at a leaf node and output the class value at that leaf node.
Trees are fast to learn and very fast for making predictions. They are also often accurate for a broad range of problems and do not require any special preparation for your data.

5 — Naive Bayes

Naive Bayes is a simple but surprisingly powerful algorithm for predictive modeling.
The model is comprised of two types of probabilities that can be calculated directly from your training data: 1) The probability of each class; and 2) The conditional probability for each class given each x value. Once calculated, the probability model can be used to make predictions for new data using Bayes Theorem. When your data is real-valued it is common to assume a Gaussian distribution (bell curve) so that you can easily estimate these probabilities.
Bayes Theorem
Naive Bayes is called naive because it assumes that each input variable is independent. This is a strong assumption and unrealistic for real data, nevertheless, the technique is very effective on a large range of complex problems.

6 — K-Nearest Neighbors

The KNN algorithm is very simple and very effective. The model representation for KNN is the entire training dataset. Simple right?
Predictions are made for a new data point by searching through the entire training set for the K most similar instances (the neighbors) and summarizing the output variable for those K instances. For regression problems, this might be the mean output variable, for classification problems this might be the mode (or most common) class value.
The trick is in how to determine the similarity between the data instances. The simplest technique if your attributes are all of the same scale (all in inches for example) is to use the Euclidean distance, a number you can calculate directly based on the differences between each input variable.
K-Nearest Neighbors
KNN can require a lot of memory or space to store all of the data, but only performs a calculation (or learn) when a prediction is needed, just in time. You can also update and curate your training instances over time to keep predictions accurate.
The idea of distance or closeness can break down in very high dimensions (lots of input variables) which can negatively affect the performance of the algorithm on your problem. This is called the curse of dimensionality. It suggests you only use those input variables that are most relevant to predicting the output variable.

7 — Learning Vector Quantization

A downside of K-Nearest Neighbors is that you need to hang on to your entire training dataset. The Learning Vector Quantization algorithm (or LVQ for short) is an artificial neural network algorithm that allows you to choose how many training instances to hang onto and learns exactly what those instances should look like.
Learning Vector Quantization
The representation for LVQ is a collection of codebook vectors. These are selected randomly in the beginning and adapted to best summarize the training dataset over a number of iterations of the learning algorithm. After learned, the codebook vectors can be used to make predictions just like K-Nearest Neighbors. The most similar neighbor (best matching codebook vector) is found by calculating the distance between each codebook vector and the new data instance. The class value or (real value in the case of regression) for the best matching unit is then returned as the prediction. Best results are achieved if you rescale your data to have the same range, such as between 0 and 1.
If you discover that KNN gives good results on your dataset try using LVQ to reduce the memory requirements of storing the entire training dataset.

8 — Support Vector Machines

Support Vector Machines are perhaps one of the most popular and talked about machine learning algorithms.
A hyperplane is a line that splits the input variable space. In SVM, a hyperplane is selected to best separate the points in the input variable space by their class, either class 0 or class 1. In two-dimensions, you can visualize this as a line and let’s assume that all of our input points can be completely separated by this line. The SVM learning algorithm finds the coefficients that results in the best separation of the classes by the hyperplane.
Support Vector Machine
The distance between the hyperplane and the closest data points is referred to as the margin. The best or optimal hyperplane that can separate the two classes is the line that has the largest margin. Only these points are relevant in defining the hyperplane and in the construction of the classifier. These points are called the support vectors. They support or define the hyperplane. In practice, an optimization algorithm is used to find the values for the coefficients that maximizes the margin.
SVM might be one of the most powerful out-of-the-box classifiers and worth trying on your dataset.

9 — Bagging and Random Forest

Random Forest is one of the most popular and most powerful machine learning algorithms. It is a type of ensemble machine learning algorithm called Bootstrap Aggregation or bagging.
The bootstrap is a powerful statistical method for estimating a quantity from a data sample. Such as a mean. You take lots of samples of your data, calculate the mean, then average all of your mean values to give you a better estimation of the true mean value.
In bagging, the same approach is used, but instead for estimating entire statistical models, most commonly decision trees. Multiple samples of your training data are taken then models are constructed for each data sample. When you need to make a prediction for new data, each model makes a prediction and the predictions are averaged to give a better estimate of the true output value.
Random Forest
Random forest is a tweak on this approach where decision trees are created so that rather than selecting optimal split points, suboptimal splits are made by introducing randomness.
The models created for each sample of the data are therefore more different than they otherwise would be, but still accurate in their unique and different ways. Combining their predictions results in a better estimate of the true underlying output value.
If you get good results with an algorithm with high variance (like decision trees), you can often get better results by bagging that algorithm.

10 — Boosting and AdaBoost

Boosting is an ensemble technique that attempts to create a strong classifier from a number of weak classifiers. This is done by building a model from the training data, then creating a second model that attempts to correct the errors from the first model. Models are added until the training set is predicted perfectly or a maximum number of models are added.
AdaBoost was the first really successful boosting algorithm developed for binary classification. It is the best starting point for understanding boosting. Modern boosting methods build on AdaBoost, most notably stochastic gradient boosting machines.
AdaBoost
AdaBoost is used with short decision trees. After the first tree is created, the performance of the tree on each training instance is used to weight how much attention the next tree that is created should pay attention to each training instance. Training data that is hard to predict is given more weight, whereas easy to predict instances are given less weight. Models are created sequentially one after the other, each updating the weights on the training instances that affect the learning performed by the next tree in the sequence. After all the trees are built, predictions are made for new data, and the performance of each tree is weighted by how accurate it was on training data.
Because so much attention is put on correcting mistakes by the algorithm it is important that you have clean data with outliers removed.

Last Takeaway

A typical question asked by a beginner, when facing a wide variety of machine learning algorithms, is “which algorithm should I use?” The answer to the question varies depending on many factors, including: (1) The size, quality, and nature of data; (2) The available computational time; (3) The urgency of the task; and (4) What you want to do with the data.
Even an experienced data scientist cannot tell which algorithm will perform the best before trying different algorithms. Although there are many other Machine Learning algorithms, these are the most popular ones. If you’re a newbie to Machine Learning, these would be a good starting point to learn.
— —
If you enjoyed this piece, I’d love it if you hit the clap button 👏 so others might stumble upon it. You can find my own code on GitHub, and more of my writing and projects at https://jameskle.com/. You can also follow me on Twitter, email me directly or find me on LinkedIn. Sign up for my newsletter to receive my latest thoughts on data science, machine learning, and artificial intelligence right at your inbox!