Saturday, April 16, 2016

Have You Tried Using a 'Nearest Neighbor Search'?


http://www.cachestocaches.com/2016/3/have-you-tried-nearest-neighbor/

Roughly a year and a half ago, I had the privelage of taking a graduate "Introduction to Machine Learning" course under the tutelage of the fantastic Professor Leslie Kaelbling. While I learned a great deal over the course of the semester, there was one minor point that she made to the class which stuck with me more than I expected it to at the time: before using a really fancy or sophisticated or "in-vogue" machine learning algorithm to solve your problem, try a simple Nearest Neighbor Search first.
Let's say I gave you a bunch of data points, each with a location in space and a value, and then asked you to predict the value of a new point in space. What would you do? Perhaps the values of you data are binary (just +s and -s) and you've heard of Support Vector Machines. Should you give that a shot? Maybe the values are continuously valued (anything on the real number line) and you feel like giving Linear Regression a whirl. Or you've heard of the ongoing Deep Learning Revolution and feel sufficiently bold as to tackle this problem with a Neural Net.
...or you could simply find the closest point in your dataset to the one you're interested in and offer up the value of this "nearest neighbor". A Nearest Neighbor Search is perhaps the simplest procedure you might conceive of if presented with a machine-learning-type problem while under the influence of some sort of generalized "veil of ignorance". Though there exist slightly more complicated variations in the algorithm, the basic principle of all of them is effectively the same.

Why Would I Use Something So ... Simple?

Of course, I've swept some potential complications under the rug here. Perhaps the most important of these is how to answer the question "how do I compute the distance?" For simple problems, this may be easy, but there are indeed cases in which the distance must be specially defined. In addition, if you don't have very many points in your initial data set, the performance of this approach is questionable (though such a case in general is enough to give most machine learning researchers pause).
Obviously, using a nearest neighbor search proably isn't going to be the answer to most of your problems, but there are compelling reasons to try it before going down another, more complicated path. It's certainly relatively easy to implement, which means that getting preliminary readings from your data can be a much quicker process. In addition, a nearest neighbor search can often provide a reasonable baseline performance, against which more sophisticated algorithms can be compared.
This last term, I was fortunate enough to serve as a Teaching Assistant for the same Machine Learning course I mentioned above. The final project in the course encourages the students to find creative ways of applying some of the more sophisticated algorithms we've tought them to data sets of their own. Many of the students would come to me during their mid-term meetings with all kinds of statistics regarding the performance of their algorithms. However, I probably would not have encountered their particular data set before, so telling me that they were able to get "over 90% accuracy" would be a meaningless quantity without something to compare it to; on some medical data sets, merely getting above 50% predictive accuracy is often stellar performance. I found myself asking most of them the same question:
Have you tried using a 'Nearest Neighbor Search'?
Machine learning is, in many ways, a science of comparison. Algorithms are compared against one another on established data sets, which themselved are adoped via some sort of popularity metric. Nearest neighbor search is so simple and easy to implement, that it can be a reasonable baseline when others are more difficult to test against. Nowadays, the highest-performing machine learning algorithms are compared to human performance, and are recently even capable of beating us.

Takeaway: More Complicated ≠ Better

Certainly there are those of you who recognize that using a nearest neighbor search isn't even applicaple in certain application domains. However, there's certainly a more general lesson to be learned here: in the midst of an age characterized by algorithms of famously "unreasonable effectiveness", it's important to remember that simpler techniques are still powerful enough to solve many problems. Using a neural network often isn't the best approach to take, even though they've dominated headlines for the last few years.
I was recently tasked with an object detection problem as a part of a demo that I've had to prepare for. My job was to detect a vehicle — not a specific type of vehicle and not a make or model of car — a single, specific vehicle that we have in our garage. My first instinct was to pull up the cutting edge work in real-time object detection (which I did) and get it to work on my own machine (which I also did) and to train it with a massive amount of images of our particular vehicle (which I was unable to do). Neural networks require a notoriously massive amount of data; this Google Neural Network paper is capable of classifying 1,000 different types of images and was trained on over a million photos. I quickly discovered that such a system was overkill for me, and resorted to using an open source implementation of a simpler algorithm. Not only was this easier to do, since there was fairly decent documentation, but the performance, while not "cutting edge" was more than enough to satisfy the relatively lax constraints for our simple demo.
The takeaway here is that though the simpler algorithms may not perform quite as well as the state-of-the-art, the gain in both time and computational complexity often outweighs the difficulties associated with more sophisticated solutions. So the next time you're faced with an unknown machine learning problem, remember to give Nearest Neighbor Search a try.

Open Biometrics

http://openbiometrics.org/

Open Source Biometric Recognition
A communal biometrics framework supporting the development of open algorithms and reproducible evaluations.

Development

OpenBR is supported on Windows, Mac OS X, and Debian Linux. The project is licensed under Apache 2.0 and releases follow the Semantic Versioning convention. Internally the code base uses the CMake build system and requires Qt and OpenCV.
Watch Fork

Support

Please reach out on our mailing list or IRC channel:
Need enterprise support? Hire the core development team! founders@rankone.io

Thursday, April 7, 2016

Machine Learning Introduction Jeremy Kun

http://jeremykun.com/2012/08/04/machine-learning-introduction/

A Series on Machine Learning

These days an absolutely staggering amount of research and development work goes into the very coarsely defined field of “machine learning.” Part of the reason why it’s so coarsely defined is because it borrows techniques from so many different fields. Many problems in machine learning can be phrased in different but equivalent ways. While they are often purely optimization problems, such techniques can be expressed in terms of statistical inference, have biological interpretations, or have a distinctly geometric and topological flavor. As a result, machine learning has come to be understood as a toolbox of techniques as opposed to a unified theory.
It is unsurprising, then, that such a multitude of mathematics supports this diversified discipline. Practitioners (that is, algorithm designers) rely on statistical inference, linear algebra, convex optimization, and dabble in graph theory, functional analysis, and topology. Of course, above all else machine learning focuses on algorithms and data.
The general pattern, which we’ll see over and over again as we derive and implement various techniques, is to develop an algorithm or mathematical model, test it on datasets, and refine the model based on specific domain knowledge. The first step usually involves a leap of faith based on some mathematical intuition. The second step commonly involves a handful of established and well understood datasets (often taken from the University of California at Irvine’s machine learning database, and there is some controversy over how ubiquitous this practice is). The third step often seems to require some voodoo magic to tweak the algorithm and the dataset to complement one another.
It is this author’s personal belief that the most important part of machine learning is the mathematical foundation, followed closely by efficiency in implementation details. The thesis is that natural data has inherent structure, and that the goal of machine learning is to represent this and utilize it. To make true progress, one must represent and analyze structure abstractly. And so this blog will focus predominantly on mathematical underpinnings of the algorithms and the mathematical structure of data.

General Plans

While we do intend to cover the classical topics of machine learning, such as neural networks and decision trees, we would like to quickly approach more sophisticated modern techniques such as support vector machines and methods based on Kolmogorov complexity. And so we put forth the ambitious list of topics (in no particular order). [update: it’s been a while since this initial post, and we’ve covered some of the topics listed below, as indicated by the links]
This long and circuitous journey will inevitably require arbitrarily large but finite detours to cover the mathematical background. We’ll cover metric spaces, functional analysis, mathematical statistics and probability theory, abstract algebra, topology, and even some category theory. Note that some of the more esoteric (i.e., advanced) topics will have their own series as well (for instance, we’ve had an itch to do computational category theory but having covered none of the typical concrete applications of category theory, the jump into extreme abstraction would come off as pointlessly complicated).
Of course, as we’ve mentioned before, while the mathematics is motivated by our desire to connect ideas, programming is motivated by what we can do. And so we’re interested in using machine learning methods to perform cool tasks. Some ideas we plan to implement on this blog include social network analysis, machine vision and optical character recognition, spam classification, natural language processing, speech recognition, and content classification and recommendation.
Finally, we are interested in the theoretical boundaries on what is possible for a computer to learn. Aside from its practical use, this area of study would require us to rigorously define what it means for a machine to “learn.” This field is known as computational learning theory, and a good deal of it is devoted to the typical complexity-theory-type questions such as “can this class of classification functions be learned in polynomial time?” In addition, this includes learning theories of a more statistical flavor, such as the “Probably Approximately Correct” model. We plan to investigate each of these models in turn as they come up in our explorations.
If any readers have suggestions for additional machine learning topics (to add to this already gargantuan list), feel free to pipe in with a comment! We’ll begin with an exploration of the simplest algorithm on the above list, k nearest neighbors, and a more rigorous exploration of metric spaces.
Until then!

Deep Learning Book

http://www.deeplearningbook.org/



Deep Learning

An MIT Press book

Ian Goodfellow, Yoshua Bengio and Aaron Courville


The Deep Learning textbook is a resource intended to help students and practitioners enter the field of machine learning in general and deep learning in particular. The online version of the book is now complete and will remain available online for free. The print version will be available for sale soon.

Citing the book

To cite this book, please use this bibtex entry:
@unpublished{Goodfellow-et-al-2016-Book,
    title={Deep Learning},
    author={Ian Goodfellow, Yoshua Bengio, and Aaron Courville},
    note={Book in preparation for MIT Press},
    url={http://www.deeplearningbook.org},
    year={2016}
}

FAQ

  • Can I get a PDF of this book? No, our contract with MIT Press forbids distribution of too easily copied electronic formats of the book.
  • Why are you using HTML format for the drafts? This format is a sort of weak DRM required by our contract with MIT Press. It's intended to discourage unauthorized copying/editing of the book. Unfortunately, the conversion from PDF to HTML is not perfect, and some things like subscript expressions do not render correctly. If you have a suggestion for a better way of making the book available to a wide audience while preventing unauthorized copies, please let us know.
  • What is the best way to print the HTML format?
  • Printing seems to work best printing directly from the browser, using Chrome. Other browsers do not work as well. In particular, the Edge browser displays the "does not equal" sign as the "equals" sign in some cases.
  • When will the book come out? It's difficult to predict. MIT Press is currently preparing the book for printing. Please contact us if you are interested in using the textbook for course materials in the short term; we will put you in contact with MIT Press.
If you notice any typos, etc., do not hesitate to contact any of the authors directly by e-mail: Ian (<lastname.firstname>@gmail.com), Yoshua (<firstname>.<lastname>@umontreal.ca), Aaron (<firstname>.<lastname>@gmail.com). The book is now complete and we are not currently making revisions beyond correcting any minor errors that remain.

Monday, April 4, 2016

TFLearn: Deep learning library featuring a higher-level API for TensorFlow

https://github.com/tflearn/tflearn

TFLearn: Deep learning library featuring a higher-level API for TensorFlow.

TFlearn is a modular and transparent deep learning library built on top of Tensorflow. It was designed to provide a higher-level API to TensorFlow in order to facilitate and speed-up experimentations, while remaining fully transparent and compatible with it.
TFLearn features include:
  • Easy-to-use and understand high-level API for implementing deep neural networks, with tutorial and examples.
  • Fast prototyping through highly modular built-in neural network layers, regularizers, optimizers, metrics...
  • Full transparency over Tensorflow. All functions are built over tensors and can be used independently of TFLearn.
  • Powerful helper functions to train any TensorFlow graph, with support of multiple inputs, outputs and optimizers.
  • Easy and beautiful graph visualization, with details about weights, gradients, activations and more...
  • Effortless device placement for using multiple CPU/GPU.
The high-level API currently supports most of recent deep learning models, such as Convolutions, LSTM, BiRNN, BatchNorm, PReLU, Residual networks, Generative networks... In the future, TFLearn is also intended to stay up-to-date with latest deep learning techniques.
Note: This is the first release of TFLearn. Contributions are more than welcome!

Overview

# Classification
tflearn.init_graph(num_cores=8, gpu_memory_fraction=0.5)

net = tflearn.input_data(shape=[None, 784])
net = tflearn.fully_connected(net, 64)
net = tflearn.dropout(net, 0.5)
net = tflearn.fully_connected(net, 10, activation='softmax')
net = tflearn.regression(net, optimizer='adam', loss='categorical_crossentropy')

model = tflearn.DNN(net)
model.fit(X, Y)
# Sequence Generation
net = tflearn.input_data(shape=[None, 100, 5000])
net = tflearn.lstm(net, 64)
net = tflearn.dropout(net, 0.5)
net = tflearn.fully_connected(net, 5000, activation='softmax')
net = tflearn.regression(net, optimizer='adam', loss='categorical_crossentropy')

model = tflearn.SequenceGenerator(net, dictionary=idx, seq_maxlen=100)
model.fit(X, Y)
model.generate(50, temperature=1.0)
There are many more examples available here.

Installation

TensorFlow Installation
TFLearn requires Tensorflow (version >= 0.7) to be installed: Tensorflow installation instructions.
TFLearn Installation
To install TFLearn, the easiest way is to run:
pip install git+https://github.com/tflearn/tflearn.git
Otherwise, you can also install from source by running (from source folder):
python setup.py install

Getting Started

See Getting Started with TFLearn for a tutorial to learn more about TFLearn functionalities.

Examples

There are many neural network implementation available, see Examples.

Documentation

http://tflearn.org/documentation.

Model Visualization

Graph
Graph Visualization
Loss & Accuracy (multiple runs)
Loss Visualization
Layers
Layers Visualization

Contributions

This is the first release of TFLearn, if you find any bug, please report it in the GitHub issues section.
Improvements and requests for new features are more than welcome! Do not hesitate to twist and tweak TFLearn, and send pull-requests.
For more info: Contribute to TFLearn.

License

MIT License

Saturday, April 2, 2016

12 Free (as in beer) Data Mining Books

I was doing some research on an algorithm this morning and came across a new book that I wasn’t aware of. That prompted me to look for more. The list of what I came up with is below.
Each of these is free-as-in-beer, which means you can download the complete version without expectation for anything in return. I think most of them are available for purchase as well, if you prefer a hard copy. Some of them include code samples in R, Python or MATLAB.
Regardless of your background, skills or goals, there’s something for you in this list. Here they are, in no particular order.

  1. An Introduction to Statistical Learning with Applications in R by James, Witten, Hastie & Tibshirani – This book is fantastic and has helped me quite a bit. It provides an overview of several methods, along with the R code for how to complete them. 426 Pages.
  2. The Elements of Statistical Learning by Hastie, Tibshirani & Friedman – This is an in-depth overview of methods, complete with theory, derivations & code. I’d definitely consider this a graduate level text. I’d also consider it one of the best books available on the topic of data mining. 745 Pages.
  3. A Programmer’s Guide to Data Mining by Ron Zacharski – This one is an online book, each chapter downloadable as a PDF. It’s also still in progress, with chapters being added a few times each year.
  4. Probabilistic Programming & Bayesian Methods for Hackers by Cam Davidson-Pilson – This book is absolutely fantastic. The author explains Bayesian statistics, provides several diverse examples of how to apply and includes Python code. Each chapter is an iPython notebook that can be downloaded.
  5. Think Bayes, Bayesian Statistics Made Simple by Allen B. Downey – Another great, easy to digest introduction to Bayesian statistics. The author’s premise is that Bayesian statistics is easier to learn & apply within the context of reusable code samples. It includes a number of examples complete with Python code. 195 Pages.
  6. Data Mining and Analysis, Fundamental Concepts and Algorithms by Zaki & Meira – This title is new to me. It’s a text book that looks to be a complete introduction with derivations & plenty of sample problems. 599 Pages.
  7. An Introduction to Data Science by Jeffrey Stanton – Overview of the skills required to succeed in data science, with a focus on the tools available within R. It has sections on interacting with the Twitter API from within R, text mining, plotting, regression as well as more complicated data mining techniques. 195 Pages.
  8. Machine Learning by Chebira, Mellouk & others – This is an introduction to more advanced machine learning methods. It includes chapters on neural networks, discriminant analysis, natural language processing, regression trees & more, complete with derivations. Each chapter is downloadable as a PDF. 422 Pages.
  9. Machine Learning – The Complete Guide – This one is new to me. It’s a collection of Wikipedia articles organized into chapters & downloadable in a number of formats. I didn’t realize they did this, but its a great idea. Because its a collection of individual articles, it covers quite a bit more material than a single author could write. This is an incredible resource.
  10. Bayesian Reasoning and Machine Learning by David Barber – This is an undergraduate textbook. It includes an overview, derivations, sample problems and MATLAB code. 648 Pages.
  11. A Course in Machine Learning by Hal Daumé III – Another complete introduction to machine learning topics. Each chapter is individually downloadable. 189 Pages.
  12. Information Theory, Inference and Learning Algorithms by David J.C. MacKay – Nice overview of machine learning topics, including an introduction and derivations. One nice feature of this book is that it has a chart that shows how various topics are related to one another. 628 Pages.
I love it that so much material is available for free. All you need is time & motivation and you can be an expert on this topic. If you were to only select one book to pursue from this list, I’d recommend either of the first two.
Sumber:
http://christonard.com/12-free-data-mining-books/

A Few Useful Things to Know about Machine Learning

http://homes.cs.washington.edu/~pedrod/papers/cacm12.pdf