Sunday, October 30, 2016

The 10 Algorithms Machine Learning Engineers Need to Know

https://gab41.lab41.org/the-10-algorithms-machine-learning-engineers-need-to-know-f4bb63f5b2fa
 
http://www.kdnuggets.com/2016/08/10-algorithms-machine-learning-engineers.html/2

It is no doubt that the sub-field of machine learning / artificial intelligence has increasingly gained more popularity in the past couple of years. As Big Data is the hottest trend in the tech industry at the moment, machine learning is incredibly powerful to make predictions or calculated suggestions based on large amounts of data. Some of the most common examples of machine learning are Netflix’s algorithms to make movie suggestions based on movies you have watched in the past or Amazon’s algorithms that recommend books based on books you have bought before.
So if you want to learn more about machine learning, how do you start? For me, my first introduction is when I took an Artificial Intelligence class when I was studying abroad in Copenhagen. My lecturer is a full-time Applied Math and CS professor at the Technical University of Denmark, in which his research areas are logic and artificial, focusing primarily on the use of logic to model human-like planning, reasoning and problem solving. The class was a mix of discussion of theory/core concepts and hands-on problem solving. The textbook that we used is one of the AI classics: Peter Norvig’s Artificial Intelligence — A Modern Approach, in which we covered major topics including intelligent agents, problem-solving by searching, adversarial search, probability theory, multi-agent systems, social AI, philosophy/ethics/future of AI. At the end of the class, in a team of 3, we implemented simple search-based agents solving transportation tasks in a virtual environment as a programming project.
I have learned a tremendous amount of knowledge thanks to that class, and decided to keep learning about this specialized topic. In the last few weeks, I have been multiple tech talks in San Francisco on deep learning, neural networks, data architecture — and a Machine Learning conference with a lot of well-known professionals in the field. Most importantly, I enrolled in Udacity’s Intro to Machine Learning online course in the beginning of June and has just finished it a few days ago. In this post, I want to share some of the most common machine learning algorithms that I learned from the course.
Machine learning algorithms can be divided into 3 broad categories — supervised learning, unsupervised learning, and reinforcement learning. Supervised learning is useful in cases where a property (label) is available for a certain dataset (training set), but is missing and needs to be predicted for other instances. Unsupervised learning is useful in cases where the challenge is to discover implicit relationships in a given unlabeled dataset (items are not pre-assigned). Reinforcement learning falls between these 2 extremes — there is some form of feedback available for each predictive step or action, but no precise label or error message. Since this is an intro class, I didn’t learn about reinforcement learning, but I hope that 10 algorithms on supervised and unsupervised learning will be enough to keep you interested.

Supervised Learning

1. Decision Trees: A decision tree is a decision support tool that uses a tree-like graph or model of decisions and their possible consequences, including chance-event outcomes, resource costs, and utility. Take a look at the image to get a sense of how it looks like.
Decision Tree
From a business decision point of view, a decision tree is the minimum number of yes/no questions that one has to ask, to assess the probability of making a correct decision, most of the time. As a method, it allows you to approach the problem in a structured and systematic way to arrive at a logical conclusion.
2. Naïve Bayes Classification: Naïve Bayes classifiers are a family of simple probabilistic classifiers based on applying Bayes’ theorem with strong (naïve) independence assumptions between the features. The featured image is the equation — with P(A|B) is posterior probability, P(B|A) is likelihood, P(A) is class prior probability, and P(B) is predictor prior probability.
Naive Bayes Classification
Some of real world examples are:
· To mark an email as spam or not spam
· Classify a news article about technology, politics, or sports
· Check a piece of text expressing positive emotions, or negative emotions?
· Used for face recognition software.
3. Ordinary Least Squares Regression: If you know statistics, you probably have heard of linear regression before. Least squares is a method for performing linear regression. You can think of linear regression as the task of fitting a straight line through a set of points. There are multiple possible strategies to do this, and “ordinary least squares” strategy go like this — You can draw a line, and then for each of the data points, measure the vertical distance between the point and the line, and add these up; the fitted line would be the one where this sum of distances is as small as possible.
Ordinary Least Squares Regression
Linear refers the kind of model you are using to fit the data, while least squares refers to the kind of error metric you are minimizing over.
4. Logistic Regression: Logistic regression is a powerful statistical way of modeling a binomial outcome with one or more explanatory variables. It measures the relationship between the categorical dependent variable and one or more independent variables by estimating probabilities using a logistic function, which is the cumulative logistic distribution.
Logistic Regression
In general, regressions can be used in real-world applications such as:
· Credit Scoring
· Measuring the success rates of marketing campaigns
· Predicting the revenues of a certain product
· Is there going to be an earthquake on a particular day?
5. Support Vector Machines: SVM is binary classification algorithm. Given a set of points of 2 types in N dimensional place, SVM generates a (N — 1) dimensional hyperlane to separate those points into 2 groups. Say you have some points of 2 types in a paper which are linearly separable. SVM will find a straight line which separates those points into 2 types and situated as far as possible from all those points.
Support Vector Machine
In terms of scale, some of the biggest problems that have been solved using SVMs (with suitably modified implementations) are display advertising, human splice site recognition, image-based gender detection, large-scale image classification…
6. Ensemble Methods: Ensemble methods are learning algorithms that construct a set of classifiers and then classify new data points by taking a weighted vote of their predictions. The original ensemble method is Bayesian averaging, but more recent algorithms include error-correcting output coding, bagging, and boosting.
Ensemble Learning Algorithms
So how do ensemble methods work and why are they superior to individual models?
· They average out biases: If you average a bunch of democratic-leaning polls and republican-leaning polls together, you will get an average something that isn’t leaning either way.
· They reduce the variance: The aggregate opinion of a bunch of models is less noisy than the single opinion of one of the models. In finance, this is called diversification — a mixed portfolio of many stocks will be much less variable than just one of the stocks alone. This is why your models will be better with more data points rather than fewer.
· They are unlikely to over-fit: If you have individual models that didn’t over-fit, and you are combining the predictions from each model in a simple way (average, weighted average, logistic regression), then there’s no room for over-fitting.

Unsupervised Learning

7. Clustering Algorithms: Clustering is the task of grouping a set of objects such that objects in the same group (cluster) are more similar to each other than to those in other groups.
Clustering Algorithms
Every clustering algorithm is different, and here are a couple of them:
· Centroid-based algorithms
· Connectivity-based algorithms
· Density-based algorithms
· Probabilistic
· Dimensionality Reduction
· Neural networks / Deep Learning
8. Principal Component Analysis: PCA is a statistical procedure that uses an orthogonal transformation to convert a set of observations of possibly correlated variables into a set of values of linearly uncorrelated variables called principal components.
Principal Component Analysis
Some of the applications of PCA include compression, simplifying data for easier learning, visualization. Notice that domain knowledge is very important while choosing whether to go forward with PCA or not. It is not suitable in cases where data is noisy (all the components of PCA have quite a high variance).
9. Singular Value Decomposition: In linear algebra, SVD is a factorization of a real complex matrix. For a given m * n matrix M, there exists a decomposition such that M = UΣV, where U and V are unitary matrices and Σ is a diagonal matrix.
Singular Value Decomposition
PCA is actually a simple application of SVD. In computer vision, the 1st face recognition algorithms used PCA and SVD in order to represent faces as a linear combination of “eigenfaces”, do dimensionality reduction, and then match faces to identities via simple methods; although modern methods are much more sophisticated, many still depend on similar techniques.
10. Independent Component Analysis: ICA is a statistical technique for revealing hidden factors that underlie sets of random variables, measurements, or signals. ICA defines a generative model for the observed multivariate data, which is typically given as a large database of samples. In the model, the data variables are assumed to be linear mixtures of some unknown latent variables, and the mixing system is also unknown. The latent variables are assumed non-gaussian and mutually independent, and they are called independent components of the observed data.
Independent Component Analysis
ICA is related to PCA, but it is a much more powerful technique that is capable of finding the underlying factors of sources when these classic methods fail completely. Its applications include digital images, document databases, economic indicators and psychometric measurements.
Now go forth and wield your understanding of algorithms to create machine learning applications that make better experiences for people everywhere.

Machine Learning Made Easy with Talend – Decision Trees

http://fr.talend.com/blog/2016/09/29/machine-learning-made-easy-with-talend-%E2%80%93-decision-trees

Decision trees are used extensively in machine learning because they are easy to use, easy to interpret, and easy to operationalize. KD Nuggets, one of the most respected sites for data science and machine learning, recently published an article that identified decision trees as a “top 10” algorithm for machine learning. (1)
Besides being easy to understand, use and operationalize, most decision tree algorithms contain very useful features, such as:
  •  Support for continuous and categorical inputs
  •  Clear indication of most important variables
  •  Minimal computation needed for classification
  •  Automatic handling of missing values
If you are new to machine learning, some of these concepts may be unfamiliar. The goal of this blog is to provide you with the basics of decision trees using Talend and Apache Spark.
If you want to learn more about advanced analytics, please see the references section below.(2)  Some of the concepts presented here are sourced from that material.

Machine Learning Definitions

In this post I’ll be referring to a few terms that warrant a brief explanation:
  •  Training Data – a set of data used to train a model, comprised of a vector input variables (features) and a known outcome (target).
  •  Test Data – a set of data used to test model performance after it has been trained.
  •  Features – training data input variables used to train a model. E.g. “age”, “income”, “loan amount.”
  •  Target – output from trained model. E.g. “risk of default.”

Decision Trees Algorithms

Decision trees have several implementations. The three most common are:
  • ID3:
    •  Invented by Ross Quinlan in 1979
    •  Builds the tree from the top down
    •  Information gain is used to select the most useful attribute
    •  Designed for classification trees
    •  Greedy algorithm tends to over-fit data
  • C4.5:
    •  Addresses over-fitting by bottom-up technique known as pruning
    •  Handles incomplete data points
    •  Handles categorical or continuous data
    •  Developed by Quinlan to expand upon ID3
  • CART - Classification and Regression Trees
    •  Developed by four Berkeley and Stanford statistics professors (1974 - 1984)
    •  Focused on accurate assessment when data is noisy (outliers and missing values)
    •  Handles categorical or continuous data as a function of the dependent variables

Getting Started

Of the three implementations, CART is the most popular. The name refers to the fact that CART can be used when the target variable is expressed either as a finite list, such as age or hair color (classification), or is continuous, such as annual precipitation (regression). * Note that it is the target variable that determines the type of tree.
Below, I’ll demonstrate a simple classification tree using data well known to the machine learning community. The kyphosis data set (3) reports on children who, having had corrective spinal surgery, exhibited kyphosis post-surgery.  Below is a sampling of the kyphosis data set which was used to train a decision tree model:

  •  Kyphosis - indicator of whether or not the condition "kyphosis" is present after surgery
  •  Age - age of patient in months
  •  Number - the number of vertebrae involved
  •  Start - the number of the first (topmost) vertebra operated on
The feature variables are 1. Age, 2. Number, and 3. Start. The target variable is indicated by Kyphosis. It is a simple category that has two levels; “absent” and “present.”

Anatomy of a Decision Tree

Below is a visual depiction of a classification tree trained using the kyphosis data set.

Key components of a decision tree (4):
  •  Root Node – Top of internal node
  •  Branch – Outcome of test
  •  Internal Node – Decision on variable
  •  Leaf Node – Class label
  •  Depth – Minimum number of steps required to reach the node from the root
Decision trees are built using a recursive partitioning algorithm that is easily described using the following pseudo code (2):
  1. For every input feature variable, assess the best way to split the data into two or more subgroups. Select the best split and divide the data into the subgroups defined by the split.
  2. Pick one of the subgroups and repeat Step 1. Repeat for every subgroup.
  3. Continue splitting until all records after the split belong to the same target variable or until another stop condition is met.
The concept of “best split” is pertinent to how decision trees are built.  Best split refers to node purity (P) which is a measure of its homogeneity (information gain) or heterogeneity (entropy).
Using the root node as an example:

Root Node: P (Kyphosis = “absent”) = 47/60 = 78% pure on the class Kyphosis = “absent.”
With this information, the decision tree algorithm then iterates through all subgroups to find the next best fit. Here, the feature “Start” where its value is greater than or equal to 12, was chosen as the best split.

Internal Node 1 Left Branch: P (Kyphosis = “absent”) = 32/34 = 94% pure on the class Kyphosis = “absent.”  The 57% is the percentage of observations used at this node.
Node pureness is used by the decision tree algorithm as input to much more involved calculations called entropy and information gain.
However, the goal of information gain is to compare the purity of the parent node before the split with the degree of purity after the split.  The feature with the greatest information gain is considered the most informative and it is used at the split. (4)  In the example above, “Start >= 12” is the most informative feature for the first internal node.

Operationalizing a Decision Tree

As mentioned earlier, decision trees are intuitive and easy to explain. They are also very easy to implement using any programming language that supports conditional statements. Using our decision tree, let’s define a few rules for the conditional logic:
  1. if (Start >= 12) then (Kyphosis=”absent”)
  2. if (Start<12) and (Age< 86) and (Number< 4.5) then (Kyphosis=”absent”)
  3. if (Start < 12) and (Age< 86) and (Number >= 4.5) and (Start<7.5) then (Kyphosis=”present”)
  4. And so on
Implementing a decision tree model is conceptually straight forward.  However, training, testing, and operationalizing a model is not without some challenges.
One such challenge is the maintenance and upkeep of the model.  The “learning” part of machine learning alludes to the fact that models need to evolve with the data. This necessitates model re-training and depending on the business use case can occur quite frequently.
To maximize efficiency, model training, re-training, and operationalization needs to be automated and Talend makes this easy.

Enabling the Data-Driven Enterprise

Talend provides a comprehensive eco-system of tools and technologies to facilitate the integration of machine learning into data integration work flows in a continuous and automated way.  This enables organizations to focus more on business outcomes and realize a faster time to value for their most valuable asset; their data.
But don’t just take my word for it. See for yourself! Get Started Right Now with Our Hands-on Tutorial: Machine Learning 101 – Decision Trees

Tuesday, October 4, 2016

Bubble sheet multiple choice scanner and test grader

http://www.pyimagesearch.com/2016/10/03/bubble-sheet-multiple-choice-scanner-and-test-grader-using-omr-python-and-opencv/



Over the past few months I’ve gotten quite the number of requests landing in my inbox to build a bubble sheet/Scantron-like test reader using computer vision and image processing techniques.
And while I’ve been having a lot of fun doing this series on machine learning and deep learning, I’d be lying if I said this little mini-project wasn’t a short, welcome break. One of my favorite parts of running the PyImageSearch blog is demonstrating how to build actual solutions to problems using computer vision.
In fact, what makes this project so special is that we are going to combine the techniques from many previous blog posts, including building a document scanner, contour sorting, and perspective transforms. Using the knowledge gained from these previous posts, we’ll be able to make quick work of this bubble sheet scanner and test grader.
You see, last Friday afternoon I quickly Photoshopped an example bubble test paper, printed out a few copies, and then set to work on coding up the actual implementation.
Overall, I am quite pleased with this implementation and I think you’ll absolutely be able to use this bubble sheet grader/OMR system as a starting point for your own projects.
To learn more about utilizing computer vision, image processing, and OpenCV to automatically grade bubble test sheets, keep reading.
Looking for the source code to this post?
Jump right to the downloads section.

Bubble sheet scanner and test grader using OMR, Python, and OpenCV

In the remainder of this blog post, I’ll discuss what exactly Optical Mark Recognition (OMR) is. I’ll then demonstrate how to implement a bubble sheet test scanner and grader using strictly computer vision and image processing techniques, along with the OpenCV library.
Once we have our OMR system implemented, I’ll provide sample results of our test grader on a few example exams, including ones that were filled out with nefarious intent.
Finally, I’ll discuss some of the shortcomings of this current bubble sheet scanner system and how we can improve it in future iterations.

What is Optical Mark Recognition (OMR)?

Optical Mark Recognition, or OMR for short, is the process of automatically analyzing human-marked documents and interpreting their results.
Arguably, the most famous, easily recognizable form of OMR are bubble sheet multiple choice tests, not unlike the ones you took in elementary school, middle school, or even high school.
If you’re unfamiliar with “bubble sheet tests” or the trademark/corporate name of “Scantron tests”, they are simply multiple-choice tests that you take as a student. Each question on the exam is a multiple choice — and you use a #2 pencil to mark the “bubble” that corresponds to the correct answer.
The most notable bubble sheet test you experienced (at least in the United States) were taking the SATs during high school, prior to filling out college admission applications.
believe that the SATs use the software provided by Scantron to perform OMR and grade student exams, but I could easily be wrong there. I only make note of this because Scantron is used in over 98% of all US school districts.
In short, what I’m trying to say is that there is a massive market for Optical Mark Recognition and the ability to grade and interpret human-marked forms and exams.

Implementing a bubble sheet scanner and grader using OMR, Python, and OpenCV

Now that we understand the basics of OMR, let’s build a computer vision system using Python and OpenCV that can read and grade bubble sheet tests.
Of course, I’ll be providing lots of visual example images along the way so you can understand exactly what techniques I’m applying and why I’m using them.
Below I have included an example filled in bubble sheet exam that I have put together for this project:
Figure 1: The example, filled in bubble sheet we are going to use when developing our test scanner software.
Figure 1: The example, filled in bubble sheet we are going to use when developing our test scanner software.
We’ll be using this as our example image as we work through the steps of building our test grader. Later in this lesson, you’ll also find additional sample exams.
I have also included a blank exam template as a .PSD (Photoshop) file so you can modify it as you see fit. You can use the “Downloads” section at the bottom of this post to download the code, example images, and template file.

The 7 steps to build a bubble sheet scanner and grader

The goal of this blog post is to build a bubble sheet scanner and test grader using Python and OpenCV.
To accomplish this, our implementation will need to satisfy the following 7 steps:
  • Step #1: Detect the exam in an image.
  • Step #2: Apply a perspective transform to extract the top-down, birds-eye-view of the exam.
  • Step #3: Extract the set of bubbles (i.e., the possible answer choices) from the perspective transformed exam.
  • Step #4: Sort the questions/bubbles into rows.
  • Step #5: Determine the marked (i.e., “bubbled in”) answer for each row.
  • Step #6: Lookup the correct answer in our answer key to determine if the user was correct in their choice.
  • Step #7: Repeat for all questions in the exam.
The next section of this tutorial will cover the actual implementation of our algorithm.

The bubble sheet scanner implementation with Python and OpenCV

To get started, open up a new file, name it test_grader.py , and let’s get to work:
On Lines 2-7 we import our required Python packages.
You should already have OpenCV and Numpy installed on your system, but you might not have the most recent version of imutils, my set of convenience functions to make performing basic image processing operations easier. To install imutils  (or upgrade to the latest version), just execute the following command:
Lines 10-12 parse our command line arguments. We only need a single switch here, --image , which is the path to the input bubble sheet test image that we are going to grade for correctness.
Line 17 then defines our ANSWER_KEY .
As the name of the variable suggests, the ANSWER_KEY  provides integer mappings of the question numbers to the index of the correct bubble.
In this case, a key of 0 indicates the first question, while a value of 1 signifies “B” as the correct answer (since “B” is the index 1 in the string “ABCDE”). As a second example, consider a key of 1 that maps to a value of 4 — this would indicate that the answer to the second question is “E”.
As a matter of convenience, I have written the entire answer key in plain english here:
  • Question #1: B
  • Question #2: E
  • Question #3: A
  • Question #4: D
  • Question #5: B
Next, let’s preprocess our input image:
On Line 21 we load our image from disk, followed by converting it to grayscale (Line 22), and blurring it to reduce high frequency noise (Line 23).
We then apply the Canny edge detector on Line 24 to find the edges/outlines of the exam.
Below I have included a screenshot of our exam after applying edge detection:
Figure 2: Applying edge detection to our exam neatly reveals the outlines of the paper.
Figure 2: Applying edge detection to our exam neatly reveals the outlines of the paper.
Notice how the edges of the document are clearly defined, with all four vertices of the exam being present in the image.
Obtaining this silhouette of the document is extremely important in our next step as we will use it as a marker to apply a perspective transform to the exam, obtaining a top-down, birds-eye-view of the document:
Now that we have the outline of our exam, we apply the cv2.findContours  function to find the lines that correspond to the exam itself.
We do this by sorting our contours by their area (from largest to smallest) on Line 37 (after making sure at least one contour was found on Line 34, of course). This implies that larger contours will be placed at the front of the list, while smaller contours will appear farther back in the list.
We make the assumption that our exam will be the main focal point of the image, and thus be larger than other objects in the image. This assumption allows us to “filter” our contours, simply by investigating their area and knowing that the contour that corresponds to the exam should be near the front of the list.
However, contour area and size is not enough — we should also check the number of vertices on the contour.
To do, this, we loop over each of our (sorted) contours on Line 40. For each of them, we approximate the contour, which in essence means we simplify the number of points in the contour, making it a “more basic” geometric shape. You can read more about contour approximation in this post on building a mobile document scanner.
On Line 47 we make a check to see if our approximated contour has four points, and if it does, we assume that we have found the exam.
Below I have included an example image that demonstrates the docCnt  variable being drawn on the original image:
Figure 3: An example of drawing the contour associated with the exam on our original image, indicating that we have successfully found the exam.
Figure 3: An example of drawing the contour associated with the exam on our original image, indicating that we have successfully found the exam.
Sure enough, this area corresponds to the outline of the exam.
Now that we have used contours to find the outline of the exam, we can apply a perspective transform to obtain a top-down, birds-eye-view of the document:
In this case, we’ll be using my implementation of the four_point_transform  function which:
  1. Orders the (x, y)-coordinates of our contours in a specific, reproducible manner.
  2. Applies a perspective transform to the region.
You can learn more about the perspective transform in this post as well as this updated one on coordinate ordering, but for the time being, simply understand that this function handles taking the “skewed” exam and transforms it, returning a top-down view of the document:
Figure 4: Obtaining a top-down, birds-eye view of both the original image along with the grayscale version.
Figure 4: Obtaining a top-down, birds-eye view of both the original image (left) along with the grayscale version (right).
Alright, so now we’re getting somewhere.
We found our exam in the original image.
We applied a perspective transform to obtain a 90 degree viewing angle of the document.
But how do we go about actually grading the document?
This step starts with binarization, or the process of thresholding/segmenting the foreground from the background of the image:
After applying Otsu’s thresholding method, our exam is now a binary image:
Figure 5: Using Otsu's thresholding allows us to segment the foreground from the background of the image.
Figure 5: Using Otsu’s thresholding allows us to segment the foreground from the background of the image.
Notice how the background of the image is black, while the foreground is white.
This binarization will allow us to once again apply contour extraction techniques to find each of the bubbles in the exam:
Lines 64-67 handle finding contours on our thresh  binary image, followed by initializing questionCnts , a list of contours that correspond to the questions/bubbles on the exam.
To determine which regions of the image are bubbles, we first loop over each of the individual contours (Line 70).
For each of these contours, we compute the bounding box (Line 73), which also allows us to compute the aspect ratio, or more simply, the ratio of the width to the height (Line 74).
In order for a contour area to be considered a bubble, the region should:
  1. Be sufficiently wide and tall (in this case, at least 20 pixels in both dimensions).
  2. Have an aspect ratio that is approximately equal to 1.
As long as these checks hold, we can update our questionCnts  list and mark the region as a bubble.
Below I have included a screenshot that has drawn the output of questionCnts  on our image:
Figure 6: Using contour filtering allows us to find all the question bubbles in our bubble sheet exam recognition software.
Figure 6: Using contour filtering allows us to find all the question bubbles in our bubble sheet exam recognition software.
Notice how only the question regions of the exam are highlighted and nothing else.
We can now move on to the “grading” portion of our OMR system:
First, we must sort our questionCnts  from top-to-bottom. This will ensure that rows of questions that are closer to the top of the exam will appear first in the sorted list.
We also initialize a bookkeeper variable to keep track of the number of correct  answers.
On Line 90 we start looping over our questions. Since each question has 5 possible answers, we’ll apply NumPy array slicing and contour sorting to to sort the current set of contours from left to right.
The reason this methodology works is because we have already sorted our contours from top-to-bottom. We know that the 5 bubbles for each question will appear sequentially in our list — but we do not know whether these bubbles will be sorted from left-to-right. The sort contour call on Line 94 takes care of this issue and ensures each row of contours are sorted into rows, from left-to-right.
To visualize this concept, I have included a screenshot below that depicts each row of questions as a separate color:
Figure 7: By sorting our contours from top-to-bottom, followed by left-to-right, we can extract each row of bubbles. Therefore, each row is equal to the bubbles for one question.
Figure 7: By sorting our contours from top-to-bottom, followed by left-to-right, we can extract each row of bubbles. Therefore, each row is equal to the bubbles for one question.
Given a row of bubbles, the next step is to determine which bubble is filled in.
We can accomplish this by using our thresh  image and counting the number of non-zero pixels (i.e., foreground pixels) in each bubble region:
Line 98 handles looping over each of the sorted bubbles in the row.
We then construct a mask for the current bubble on Line 101 and then count the number of non-zero pixels in the masked region (Lines 107 and 108). The more non-zero pixels we count, then the more foreground pixels there are, and therefore the bubble with the maximum non-zero count is the index of the bubble that the the test taker has bubbled in (Line 113 and 114).
Below I have included an example of creating and applying a mask to each bubble associated with a question:
Figure 8: An example of constructing a mask for each bubble in a row.
Figure 8: An example of constructing a mask for each bubble in a row.
Clearly, the bubble associated with “B” has the most thresholded pixels, and is therefore the bubble that the user has marked on their exam.
This next code block handles looking up the correct answer in the ANSWER_KEY , updating any relevant bookkeeper variables, and finally drawing the marked bubble on our image:
Based on whether the test taker was correct or incorrect yields which color is drawn on the exam. If the test taker is correct, we’ll highlight their answer in green. However, if the test taker made a mistake and marked an incorrect answer, we’ll let them know by highlighting the correct answer in red:
Figure 9: Drawing a "green" circle to mark "correct" or a "red" circle to mark "incorrect".
Figure 9: Drawing a “green” circle to mark “correct” or a “red” circle to mark “incorrect”.
Finally, our last code block handles scoring the exam and displaying the results to our screen:
Below you can see the output of our fully graded example image:
Figure 10: Finishing our OMR system for grading human-taken exams.
Figure 10: Finishing our OMR system for grading human-taken exams.
In this case, the reader obtained an 80% on the exam. The only question they missed was #4 where they incorrectly marked “C” as the correct answer (“D” was the correct choice).

Why not use circle detection?

After going through this tutorial, you might be wondering:
“Hey Adrian, an answer bubble is a circle. So why did you extract contours instead of applying Hough circles to find the circles in the image?”
Great question.
To start, tuning the parameters to Hough circles on an image-to-image basis can be a real pain. But that’s only a minor reason.
The real reason is:
User error.
How many times, whether purposely or not, have you filled in outside the lines on your bubble sheet? I’m not expert, but I’d have to guess that at least 1 in every 20 marks a test taker fills in is “slightly” outside the lines.
And guess what?
Hough circles don’t handle deformations in their outlines very well — your circle detection would totally fail in that case.
Because of this, I instead recommend using contours and contour properties to help you filter the bubbles and answers. The cv2.findContours  function doesn’t care if the bubble is “round”, “perfectly round”, or “oh my god, what the hell is that?”.
Instead, the cv2.findContours  function will return a set of blobs to you, which will be the foreground regions in your image. You can then take these regions process and filter them to find your questions (as we did in this tutorial), and go about your way.

Our bubble sheet test scanner and grader results

To see our bubble sheet test grader in action, be sure to download the source code and example images to this post using the “Downloads” section at the bottom of the tutorial.
We’ve already seen test_01.png  as our example earlier in this post, so let’s try test_02.png :
Here we can see that a particularly nefarious user took our exam. They were not happy with the test, writing “#yourtestsux” across the front of it along with an anarchy inspiring “#breakthesystem”. They also marked “A” for all answers.
Perhaps it comes as no surprise that the user scored a pitiful 20% on the exam, based entirely on luck:
Figure 11: By using contour filtering, we are able to ignore the regions of the exam that would have otherwise compromised its integrity.
Figure 11: By using contour filtering, we are able to ignore the regions of the exam that would have otherwise compromised its integrity.
Let’s try another image:
This time the reader did a little better, scoring a 60%:
Figure 12: Building a bubble sheet scanner and test grader using Python and OpenCV.
Figure 12: Building a bubble sheet scanner and test grader using Python and OpenCV.
In this particular example, the reader simply marked all answers along a diagonal:
Figure 13: Optical Mark Recognition for test scoring using Python and OpenCV.
Figure 13: Optical Mark Recognition for test scoring using Python and OpenCV.
Unfortunately for the test taker, this strategy didn’t pay off very well.
Let’s look at one final example:
Figure 14: Recognizing bubble sheet exams using computer vision.
Figure 14: Recognizing bubble sheet exams using computer vision.
This student clearly studied ahead of time, earning a perfect 100% on the exam.

Extending the OMR and test scanner

Admittedly, this past summer/early autumn has been one of the busiest periods of my life, so I needed to timebox the development of the OMR and test scanner software into a single, shortened afternoon last Friday.
While I was able to get the barebones of a working bubble sheet test scanner implemented, there are certainly a few areas that need improvement. The most obvious area for improvement is the logic to handle non-filled in bubbles.
In the current implementation, we (naively) assume that a reader has filled in one and only one bubble per question row.
However, since we determine if a particular bubble is “filled in” simply by counting the number of thresholded pixels in a row and then sorting in descending order, this can lead to two problems:
  1. What happens if a user does not bubble in an answer for a particular question?
  2. What if the user is nefarious and marks multiple bubbles as “correct” in the same row?
Luckily, detecting and handling of these issues isn’t terribly challenging, we just need to insert a bit of logic.
For issue #1, if a reader chooses not to bubble in an answer for a particular row, then we can place a minimum threshold on Line 108 where we compute cv2.countNonZero :
Figure 15: Detecting if a user has marked zero bubbles on the exam.
Figure 15: Detecting if a user has marked zero bubbles on the exam.
If this value is sufficiently large, then we can mark the bubble as “filled in”. Conversely, if total  is too small, then we can skip that particular bubble. If at the end of the row there are no bubbles with sufficiently large threshold counts, we can mark the question as “skipped” by the test taker.
A similar set of steps can be applied to issue #2, where a user marks multiple bubbles as correct for a single question:
Figure 16: Detecting if a user has marked multiple bubbles for a given question.
Figure 16: Detecting if a user has marked multiple bubbles for a given question.
Again, all we need to do is apply our thresholding and count step, this time keeping track if there are multiple bubbles that have a total  that exceeds some pre-defined value. If so, we can invalidate the question and mark the question as incorrect.

Summary

In this blog post, I demonstrated how to build a bubble sheet scanner and test grader using computer vision and image processing techniques.
Specifically, we implemented Optical Mark Recognition (OMR) methods that facilitated our ability of capturing human-marked documents and automatically analyzing the results.
Finally, I provided a Python and OpenCV implementation that you can use for building your own bubble sheet test grading systems.
If you have any questions, please feel free to leave a comment in the comments section!
But before you, be sure to enter your email address in the form below to be notified when future tutorials are published on the PyImageSearch blog!