Mesin Belajar: May 2019

Friday, May 31, 2019

Competition: Collaborative Deep Learning for Reading Japanese

https://www.wandb.com/articles/collaborative-deep-learning-for-reading-japanese

TL;DR let’s train a network on a rare visual language together—join us!

Weights & Biases makes collaborative deep learning easy for teams by organizing experiments and notes in a shared workspace, tracking all the code, standardizing all other inputs and outputs, and handling the boring parts (like plotting results) so you can focus on the most exciting part—solving an interesting and meaningful puzzle in conversation with others.
‍

Kmnist Benchmark

With public benchmarks, we want to explore collaboration at the broader community level: how can individual effort and ideas be maximally accessible to and useful for the field? We add wandb to the Kuzushiji-MNIST dataset (kmnist): images of 10 different characters of a classical Japanese cursive script. In three commands (at the bottom of this post), you can set up and run your own version, play with hyperparameters, and visualize model performance in your browser.
‍

Dataset

We chose this dataset because it is a fresh reimagining of the well-known baseline of handwritten digits (mnist). It preserves the technical simplicity of mnist and offers more creative headroom, since the solution space is less explored and visual intuition is unreliable (only a few experts can read Kuzushiji). Mnist generalization ends at 10 digits; kmnist extends to Kuzushiji-49 (270,912 images, 49 characters) and the heavily-imbalanced Kuzushiji-Kanji (140,426 images, 3832 characters, some with 12 distinct variants). While mnist is basically solved, kmnist can help us understand the structure of a disappearing language and digitize ~300,000 old Japanese books (see this paper for more details).
‍

Small incentive

To incentivize initial work on the kmnist benchmark, we’re offering $1000 in compute credits to the contributor of the highest validation accuracy within six weeks (by July 8th). We hope you will use those to make something awesome!
‍

The larger incentive

We are developing benchmarks to encourage clear documentation, synthesis of background and new ideas, and compression of research effort. We can build better and faster by starting with a team at the top of the collectively-reinforced foundation instead of alone at the bottom of a pile of papers and blog posts. We hope you’ll help us nudge the machine learning world in this direction by collaborating on the benchmark here.
‍

How to start

We wanted to make it ridiculously easy to participate. You can go to https://app.wandb.ai/wandb/kmnist/benchmark or follow these commands:
1. Get the code and training data:

> git clone https://github.com/wandb/kmnist

> cd kmnist/benchmarks && pip install -r requirements.txt

> ./init.sh kmnist

2. Run your first training run:

> python cnn_kmnist.py --quick_run

‍

SpArSe: Sparse Architecture Search for CNNson Resource-Constrained Microcontrollers

https://arxiv.org/pdf/1905.12107.pdf

The vast majority of processors in the world are actually microcon-troller units (MCUs), which find widespread use performing simple con-trol tasks in applications ranging from automobiles to medical devices andoffice equipment. The Internet of Things (IoT) promises to inject machinelearning into many of these every-day objects via tiny, cheap MCUs. How-ever, these resource-impoverished hardware platforms severely limit thecomplexity of machine learning models that can be deployed.For exam-ple, although convolutional neural networks (CNNs) achieve state-of-the-art results on many visual recognition tasks, CNN inferenceon MCUs ischallenging due to severe finite memory limitations. To circumvent thememory challenge associated with CNNs, various alternatives have beenproposed that do fit within the memory budget of an MCU, albeitat thecost of prediction accuracy. This paper challenges the ideathat CNNs arenot suitable for deployment on MCUs. We demonstrate that it is possi-ble to automatically design CNNs which generalize well, while also beingsmall enough to fit onto memory-limited MCUs. Our Sparse ArchitectureSearch method combines neural architecture search with pruning in a sin-gle, unified approach, which learns superior models on four popular IoTdatasets. The CNNs we find are more accurate and up to4.35×smallerthan previous approaches, while meeting the strict MCU working memoryconstraint.

Wednesday, May 29, 2019

Notes to Myself on Software Engineering

https://medium.com/s/story/notes-to-myself-on-software-engineering-c890f16f4e4d

On the Development Process

Code isn’t just meant to be executed. Code is also a means of communication across a team, a way to describe to others the solution to a problem. Readable code is not a nice-to-have, it is a fundamental part of what writing code is about. This involves factoring code clearly, picking self-explanatory variable names, and inserting comments to describe anything that’s implicit.
Ask not what your pull request can do for your next promotion, ask what your pull request can do for your users and your community. Avoid “conspicuous contribution” at all cost. Let no feature be added if it isn’t clearly helping with the purpose of your product.
Taste applies to code, too. Taste is a constraint-satisfaction process regularized by a desire for simplicity. Keep a bias toward simplicity.
It’s okay to say no — just because someone asks for a feature doesn’t mean you should do it. Every feature has a cost that goes beyond the initial implementation: maintenance cost, documentation cost, and cognitive cost for your users. Always ask: Should we really do this? Often, the answer is simply no.
When you say yes to a request for supporting a new use case, remember that literally adding what the user requested is often not the optimal choice. Users are focused on their own specific use case, and you must counter this with a holistic and principled vision of the whole project. Often, the right answer is to extend an existing feature.
Invest in continuous integration and aim for full unit test coverage. Make sure you are in an environment where you can code with confidence; if that isn’t the case, start by focusing on building the right infrastructure.
It’s okay not to plan everything in advance. Try things and see how they turn out. Revert incorrect choices early. Make sure you create an environment where that is possible.
Good software makes hard things easy. Just because a problem looks difficult at first doesn’t mean the solution will have to be complex or hard to use. Too often, engineers go with reflex solutions that introduce undesirable complexity (Let’s use ML! Let’s build an app! Let’s add blockchain!) in situations where a far easier, though maybe less obvious, alternative is available. Before you write any code, make sure your solution of choice cannot be made any simpler. Approach everything from first principles.
Avoid implicit rules. Implicit rules that you find yourself developing should always be made explicit and shared with others or automated. Whenever you find yourself coming up with a recurring, quasi-algorithmic workflow, you should seek to formalize it into a documented process, so that other team members will benefit from the experience. In addition, you should seek to automate in software any part of such a workflow that can be automated (e.g., correctness checks).
The total impact of your choices should be taken into account in the design process, not just the bits you want to focus on — such as revenue or growth. Beyond the metrics you are monitoring, what total impact does your software have on its users, on the world? Are there undesirable side effects that outweigh the value proposition? What can you do to address them while preserving the software’s usefulness?

Design for ethics. Bake your values into your creations.

On API Design

Your API has users, thus it has a user experience. In every decision you make, always keep the user in mind. Have empathy for your users, whether they are beginners or experienced developers.
Always seek to minimize the cognitive load imposed on your users in the course of using your API. Automate what can be automated, minimize the actions and choices needed from the user, don’t expose options that are unimportant, design simple and consistent workflows that reflect simple and consistent mental models.
Simple things should be simple, complex things should be possible. Don’t increase the cognitive load of common use cases for the sake of niche use cases, even minimally.
If the cognitive load of a workflow is sufficiently low, it should be possible for a user to go through it from memory (without looking up a tutorial or documentation) after having done it once or twice.
Seek to have an API that matches the mental models of domain experts and practitioners. Someone who has domain experience, but no experience with your API, should be able to intuitively understand your API using minimal documentation, mostly just by looking at a couple of code examples and seeing what objects are available and what their signatures are.
The meaning of an argument should be understandable without having any context about the underlying implementation. Arguments that have to be specified by users should relate to the mental models that the users have about the problem, not to implementation details in your code. An API is all about the problem it solves, not about how the software works in the background.
The most powerful mental models are modular and hierarchical: simple at a high level, yet precise as you need to go into details. In the same way, a good API is modular and hierarchical: easy to approach, yet expressive. There is a balance to strike between having complex signatures on fewer objects, and having more objects with simpler signatures. A good API has a reasonable number of objects, with reasonably simple signatures.
Your API is inevitably a reflection of your implementation choices, in particular your choice of data structures. To achieve an intuitive API, you must choose data structures that naturally fit the domain at hand — that match the mental models of domain experts.
Deliberately design end-to-end workflows, not a set of atomic features. Most developers approach API design by asking: What capabilities should be available? Let’s have configuration options for them. Instead, ask: What are the use cases for this tool? For each use case, what is the optimal sequence of user actions? What’s the easiest API that could support this workflow? Atomic options in your API should answer a clear need that arises in a high-level workflow — they should not be added “because someone might need it.”
Error messages, and in general any feedback provided to a user in the course of interacting with your API, is part of the API. Interactivity and feedback are integral to the user experience. Design your API’s error messages deliberately.
Because code is communication, naming matters — whether naming a project or a variable. Names reflect how you think about a problem. Avoid overly generic names (x, variable, parameter), avoid OverlyLongAndSpecificNamingPatterns, avoid terms that can create unnecessary friction (master, slave), and make sure you are consistent in your naming choices. Naming consistency means both internal naming consistency (don’t call “dim” what is called “axis” in other places) and consistency with established conventions for the problem domain. Before settling on a name, make sure to look up existing names used by domain experts (or other APIs).
Documentation is central to the user experience of your API. It is not an add-on. Invest in high-quality documentation; you will see higher returns than investing in more features.
Show, don’t tell: Your documentation should not talk about how the software works, it should show how to use it. Show code examples for end-to-end workflows; show code examples for each and every common use case and key feature of your API.

Productivity boils down to high-velocity decision-making and a bias for action.

On Software Careers

Career progress is not how many people you manage, it is how much of an impact you make: the differential between a world with and without your work.
Software development is teamwork; it is about relationships as much as it is about technical ability. Be a good teammate. As you go on your way, stay in touch with people.
Technology is never neutral. If your work has any impact on the world, then this impact has a moral direction. The seemingly innocuous technical choices we make in software products modulate the terms of access to technology, its usage incentives, who will benefit, and who will suffer. Technical choices are also ethical choices. Thus, always be deliberate and explicit about the values you want your choices to support. Design for ethics. Bake your values into your creations. Never think, I’m just building the capability; that in itself is neutral. It is not because the way you build it determines how it will get used.
Self-direction — agency over your work and your circumstances — is the key to life satisfaction. Make sure you grant sufficient self-direction to the people around you, and make sure your career choices result in greater agency for yourself.
Build what the world needs — not just what you wish you had. Too often, technologists live rarefied lives and focus on products catering to their own specific needs. Seek opportunities to broaden your life experience, which will give you better visibility into what the world needs.
When making any choice with long-term repercussions, place your values above short-term self-interest and passing emotions — such as greed or fear. Know what your values are, and let them guide you.
When we find ourselves in a conflict, it’s a good idea to pause to acknowledge our shared values and our shared goals, and remind ourselves that we are, almost certainly, on the same side.
Productivity boils down to high-velocity decision-making and a bias for action. This requires a) good intuition, which comes from experience, so as to make generally correct decisions given partial information, b) a keen awareness of when to move more carefully and wait for more information, because the cost of an incorrect decision would be greater than cost of the delay. The optimal velocity/quality decision-making tradeoff can vary greatly in different environments.
Making decisions faster means you make more decisions over the course of your career, which will give you stronger intuition about the correctness of available options. Experience is key to productivity, and greater productivity will provide you with more experience: a virtuous cycle.
In situations where you are aware that your intuition is lacking, adhere to abstract principles. Build up lists of tried-and-true principles throughout your career. Principles are formalized intuition that generalize to a broader range of situations than raw pattern recognition (which requires direct and extensive experience of similar situations).

EfficientNet: Improving Accuracy and Efficiency through AutoML and Model Scaling

https://ai.googleblog.com/2019/05/efficientnet-improving-accuracy-and.htmL

EfficientNet: Improving Accuracy and Efficiency through AutoML and Model Scaling

Wednesday, May 29, 2019

Posted by Mingxing Tan, Staff Software Engineer and Quoc V. Le, Principal Scientist, Google AI Convolutional neural networks (CNNs) are commonly developed at a fixed resource cost, and then scaled up in order to achieve better accuracy when more resources are made available. For example, ResNet can be scaled up from ResNet-18 to ResNet-200 by increasing the number of layers, and recently, GPipe achieved 84.3% ImageNet top-1 accuracy by scaling up a baseline CNN by a factor of four. The conventional practice for model scaling is to arbitrarily increase the CNN depth or width, or to use larger input image resolution for training and evaluation. While these methods do improve accuracy, they usually require tedious manual tuning, and still often yield suboptimal performance. What if, instead, we could find a more principled method to scale up a CNN to obtain better accuracy and efficiency? In our ICML 2019 paper, “EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks”, we propose a novel model scaling method that uses a simple yet highly effective compound coefficient to scale up CNNs in a more structured manner. Unlike conventional approaches that arbitrarily scale network dimensions, such as width, depth and resolution, our method uniformly scales each dimension with a fixed set of scaling coefficients. Powered by this novel scaling method and recent progress on AutoML, we have developed a family of models, called EfficientNets, which superpass state-of-the-art accuracy with up to 10x better efficiency (smaller and faster). Compound Model Scaling: A Better Way to Scale Up CNNs In order to understand the effect of scaling the network, we systematically studied the impact of scaling different dimensions of the model. While scaling individual dimensions improves model performance, we observed that balancing all dimensions of the network—width, depth, and image resolution—against the available resources would best improve overall performance. The first step in the compound scaling method is to perform a grid search to find the relationship between different scaling dimensions of the baseline network under a fixed resource constraint (e.g., 2x more FLOPS).This determines the appropriate scaling coefficient for each of the dimensions mentioned above. We then apply those coefficients to scale up the baseline network to the desired target model size or computational budget.

Comparison of different scaling methods. Unlike conventional scaling methods (b)-(d) that arbitrary scale a single dimension of the network, our compound scaling method uniformly scales up all dimensions in a principled way.

This compound scaling method consistently improves model accuracy and efficiency for scaling up existing models such as MobileNet (+1.4% imagenet accuracy), and ResNet (+0.7%), compared to conventional scaling methods. EfficientNet Architecture The effectiveness of model scaling also relies heavily on the baseline network. So, to further improve performance, we have also developed a new baseline network by performing a neural architecture search using the AutoML MNAS framework, which optimizes both accuracy and efficiency (FLOPS). The resulting architecture uses mobile inverted bottleneck convolution (MBConv), similar to MobileNetV2 and MnasNet, but is slightly larger due to an increased FLOP budget. We then scale up the baseline network to obtain a family of models, called EfficientNets.

The architecture for our baseline network EfficientNet-B0 is simple and clean, making it easier to scale and generalize.

EfficientNet Performance We have compared our EfficientNets with other existing CNNs on ImageNet. In general, the EfficientNet models achieve both higher accuracy and better efficiency over existing CNNs, reducing parameter size and FLOPS by an order of magnitude. For example, in the high-accuracy regime, our EfficientNet-B7 reaches state-of-the-art 84.4% top-1 / 97.1% top-5 accuracy on ImageNet, while being 8.4x smaller and 6.1x faster on CPU inference than the previous Gpipe. Compared with the widely used ResNet-50, our EfficientNet-B4 uses similar FLOPS, while improving the top-1 accuracy from 76.3% of ResNet-50 to 82.6% (+6.3%).

Model Size vs. Accuracy Comparison. EfficientNet-B0 is the baseline network developed by AutoML MNAS, while Efficient-B1 to B7 are obtained by scaling up the baseline network. In particular, our EfficientNet-B7 achieves new state-of-the-art 84.4% top-1 / 97.1% top-5 accuracy, while being 8.4x smaller than the best existing CNN.

Though EfficientNets perform well on ImageNet, to be most useful, they should also transfer to other datasets. To evaluate this, we tested EfficientNets on eight widely used transfer learning datasets. EfficientNets achieved state-of-the-art accuracy in 5 out of the 8 datasets, such as CIFAR-100 (91.7%) and Flowers (98.8%), with an order of magnitude fewer parameters (up to 21x parameter reduction), suggesting that our EfficientNets also transfer well. By providing significant improvements to model efficiency, we expect EfficientNets could potentially serve as a new foundation for future computer vision tasks. Therefore, we have open-sourced all EfficientNet models, which we hope can benefit the larger machine learning community. You can find the EfficientNet source code and TPU training scripts here. Acknowledgements: Special thanks to Hongkun Yu, Ruoming Pang, Vijay Vasudevan, Alok Aggarwal, Barret Zoph, Xianzhi Du, Xiaodan Song, Samy Bengio, Jeff Dean, and the Google Brain team.

Wednesday, May 22, 2019

Pythia modular framework for vision and language multimodal research

https://github.com/facebookresearch/pythia

Pythia is a modular framework for vision and language multimodal research. Built on top of PyTorch, it features:

Model Zoo: Reference implementations for state-of-the-art vision and language model including LoRRA (SoTA on VQA and TextVQA), Pythia model (VQA 2018 challenge winner) and BAN.
Multi-Tasking: Support for multi-tasking which allows training on multiple dataset together.
Datasets: Includes support for various datasets built-in including VQA, VizWiz, TextVQA and VisualDialog.
Modules: Provides implementations for many commonly used layers in vision and language domain
Distributed: Support for distributed training based on DataParallel as well as DistributedDataParallel.
Unopinionated: Unopinionated about the dataset and model implementations built on top of it.
Customization: Custom losses, metrics, scheduling, optimizers, tensorboard; suits all your custom needs.

You can use Pythia to bootstrap for your next vision and language multimodal research project.
Pythia can also act as starter codebase for challenges around vision and language datasets (TextVQA challenge, VQA challenge)

CNNs, Part 1: An Introduction to Convolutional Neural Networks

https://victorzhou.com/blog/intro-to-cnns-part-1/

A simple guide to what CNNs are, how they work, and how to build one from scratch in Python.

May 22, 2019

There’s been a lot of buzz about Convolution Neural Networks (CNNs) in the past few years, especially because of how they’ve revolutionized the field of Computer Vision. In this post, we’ll build on a basic background knowledge of neural networks and explore what CNNs are, understand how they work, and build a real one from scratch (using only numpy) in Python.
This post assumes only a basic knowledge of neural networks. My introduction to Neural Networks covers everything you’ll need to know, so you might want to read that first.
Ready? Let’s jump in.

1. Motivation

A classic use case of CNNs is to perform image classification, e.g. looking at an image of a pet and deciding whether it’s a cat or a dog. It’s a seemingly simple task - why not just use a normal Neural Network?
Good question.

Reason 1: Images are Big

Images used for Computer Vision problems nowadays are often 224x224 or larger. Imagine building a neural network to process 224x224 color images: including the 3 color channels (RGB) in the image, that comes out to 224 x 224 x 3 = 150,528 input features! A typical hidden layer in such a network might have 1024 nodes, so we’d have to train 150,528 x 1024 = 150+ million weights for the first layer alone. Our network would be huge and nearly impossible to train.
It’s not like we need that many weights, either. The nice thing about images is that we know pixels are most useful in the context of their neighbors. Objects in images are made up of small, localized features, like the circular iris of an eye or the square corner of a piece of paper. Doesn’t it seem wasteful for every node in the first hidden layer to look at every pixel?

Reason 2: Positions can change

If you trained a network to detect dogs, you’d want it to be able to a detect a dog regardless of where it appears in the image. Imagine training a network that works well on a certain dog image, but then feeding it a slightly shifted version of the same image. The dog would not activate the same neurons, so the network would react completely differently!
We’ll see soon how a CNN can help us mitigate these problems.

2. Dataset

In this post, we’ll tackle the “Hello, World!” of Computer Vision: the MNIST handwritten digit classification problem. It’s simple: given an image, classify it as a digit.

Each image in the MNIST dataset is 28x28 and contains a centered, grayscale digit.
Truth be told, a normal neural network would actually work just fine for this problem. You could treat each image as a 28 x 28 = 784-dimensional vector, feed that to a 784-dim input layer, stack a few hidden layers, and finish with an output layer of 10 nodes, 1 for each digit.
This would only work because the MNIST dataset contains small images that are centered, so we wouldn’t run into the aforementioned issues of size or shifting. Keep in mind throughout the course of this post, however, that most real-world image classification problems aren’t this easy.
Enough buildup. Let’s get into CNNs!

3. Convolutions

What are Convolutional Neural Networks?
They’re basically just neural networks that use Convolutional layers, a.k.a. Conv layers, which are based on the mathematical operation of convolution. Conv layers consist of a set of filters, which you can think of as just 2d matrices of numbers. Here’s an example 3x3 filter:

We can use an input image and a filter to produce an output image by convolving the filter with the input image. This consists of

Overlaying the filter on top of the image at some location.
Performing element-wise multiplication between the values in the filter and their corresponding values in the image.
Summing up all the element-wise products. This sum is the output value for the destination pixel in the output image.
Repeating for all locations.

Side Note: We (along with many CNN implementations) are technically actually using cross-correlation instead of convolution here, but they do almost the same thing. I won’t go into the difference in this post because it’s not that important, but feel free to look this up if you’re curious.

That 4-step description was a little abstract, so let’s do an example. Consider this tiny 4x4 grayscale image and this 3x3 filter:

A 4x4 image (left) and a 3x3 filter (right)

The numbers in the image represent pixel intensities, where 0 is black and 255 is white. We’ll convolve the input image and the filter to produce a 2x2 output image:

To start, lets overlay our filter in the top left corner of the image:

Step 1: Overlay the filter (right) on top of the image (left)

Next, we perform element-wise multiplication between the overlapping image values and filter values. Here are the results, starting from the top left corner and going right, then down:

Image Value	Filter Value	Result
0	-1	0
50	0	0
0	1	0
0	-2	0
80	0	0
31	2	62
33	-1	-33
90	0	0
0	1	0

Step 2: Performing element-wise multiplication.

Next, we sum up all the results. That’s easy enough:

62 - 33 = \boxed{29}

Finally, we place our result in the destination pixel of our output image. Since our filter is overlayed in the top left corner of the input image, our destination pixel is the top left pixel of the output image:

We do the same thing to generate the rest of the output image:

3.1 How is this useful?

Let’s zoom out for a second and see this at a higher level. What does convolving an image with a filter do? We can start by using the example 3x3 filter we’ve been using, which is commonly known as the vertical Sobel filter:

Here’s an example of what the vertical Sobel filter does:

lenna vertical — An image convolved with the vertical Sobel filter

Similarly, there’s also a horizontal Sobel filter:

lenna horizontal — An image convolved with the horizontal Sobel filter

See what’s happening? Sobel filters are edge-detectors. The vertical Sobel filter detects vertical edges, and the horizontal Sobel filter detects horizontal edges. The output images are now easily interpreted: a bright pixel (one that has a high value) in the output image indicates that there’s a strong edge around there in the original image.
Can you see why an edge-detected image might be more useful than the raw image? Think back to our MNIST handwritten digit classification problem for a second. A CNN trained on MNIST might look for the digit 1, for example, by using an edge-detection filter and checking for two prominent vertical edges near the center of the image. In general, convolution helps us look for specific localized image features (like edges) that we can use later in the network.

3.2 Padding

Remember convolving a 4x4 input image with a 3x3 filter earlier to produce a 2x2 output image? Often times, we’d prefer to have the output image be the same size as the input image. To do this, we add zeros around the image so we can overlay the filter in more places. A 3x3 filter requires 1 pixel of padding:

A 4x4 input convolved with a 3x3 filter to produce a 4x4 output using same padding

This is called “same” padding, since the input and output have the same dimensions. Not using any padding, which is what we’ve been doing and will continue to do for this post, is sometimes referred to as “valid” padding.

3.3 Conv Layers

Now that we know how image convolution works and why it’s useful, let’s see how it’s actually used in CNNs. As mentioned before, CNNs include conv layers that use a set of filters to turn input images into output images. A conv layer’s primary parameter is the number of filters it has.
For our MNIST CNN, we’ll use a small conv layer with 8 filters as the initial layer in our network. This means it’ll turn the 28x28 input image into a 26x26x8 output volume:

Reminder: The output is 26x26x8 and not 28x28x8 because we’re using valid padding, which decreases the input’s width and height by 2.

Each of the 4 filters in the conv layer produces a 26x26 output, so stacked together they make up a 26x26x8 volume. All of this happens because of 3

\times

3 (filter size)

\times

8 (number of filters) = only 72 weights!

3.4 Implementing Convolution

Time to put what we’ve learned into code! We’ll implement a conv layer’s feedforward portion, which takes care of convolving filters with an input image to produce an output volume. For simplicity, we’ll assume filters are always 3x3 (which is not true - 5x5 and 7x7 filters are also very common).
Let’s start implementing a conv layer class:

conv.py

import numpy as np

class Conv3x3:
  # A Convolution layer using 3x3 filters.

  def __init__(self, num_filters):
    self.num_filters = num_filters

    # filters is a 3d array with dimensions (num_filters, 3, 3)
    # We divide by 9 to reduce the variance of our initial values
    self.filters = np.random.randn(num_filters, 3, 3) / 9

The Conv3x3 class takes only one argument: the number of filters. In the constructor, we store the number of filters and initialize a random filters array using NumPy’s randn() method.

Note: Diving by 9 during the initialization is more important than you may think. If the initial values are too large or too small, training the network will be ineffective. To learn more, read about Xavier Initialization.

Next, the actual convolution:

conv.py

class Conv3x3:
  # ...

  def iterate_regions(self, image):
    '''
    Generates all possible 3x3 image regions using valid padding.
    - image is a 2d numpy array
    '''
    h, w = image.shape

    for i in range(h - 2):
      for j in range(w - 2):
        im_region = image[i:(i + 3), j:(j + 3)]
        yield im_region, i, j

  def forward(self, input):
    '''
    Performs a forward pass of the conv layer using the given input.
    Returns a 3d numpy array with dimensions (h, w, num_filters).
    - input is a 2d numpy array
    '''
    h, w = input.shape
    output = np.zeros((h - 2, w - 2, self.num_filters))

    for im_region, i, j in self.iterate_regions(input):
      output[i, j] = np.sum(im_region * self.filters, axis=(1, 2))
    return output

iterate_regions() is a helper generator method that yields all valid 3x3 image regions for us. This will be useful for implementing the backwards portion of this class later on.
The line of code that actually performs the convolutions is highlighted above. Let’s break it down:

We have im_region, a 3x3 array containing the relevant image region.
We have self.filters, a 3d array.
We do im_region * self.filters, which uses numpy’s broadcasting feature to element-wise multiply the two arrays. The result is a 3d array with the same dimension as self.filters.
We np.sum() the result of the previous step using axis=(1, 2), which produces a 1d array of length num_filters where each element contains the convolution result for the corresponding filter.
We assign the result to output[i, j], which contains convolution results for pixel (i, j) in the output.

The sequence above is performed for each pixel in the output until we obtain our final output volume! Let’s give our code a test run:

cnn.py

import mnist
from conv import Conv3x3

# The mnist package handles the MNIST dataset for us!
# Learn more at https://github.com/datapythonista/mnist
train_images = mnist.train_images()
train_labels = mnist.train_labels()

conv = Conv3x3(8)
output = conv.forward(train_images[0])
print(output.shape) # (26, 26, 8)

Looks good so far.

Note: in our Conv3x3 implementation, we assume the input is a 2d numpy array for simplicity, because that’s how our MNIST images are stored. This works for us because we use it as the first layer in our network, but most CNNs have many more Conv layers. If we were building a bigger network that needed to use Conv3x3 multiple times, we’d have to make the input be a 3d numpy array.

4. Pooling

Neighboring pixels in images tend to have similar values, so conv layers will typically also produce similar values for neighboring pixels in outputs. As a result, much of the information contained in a conv layer’s output is redundant. For example, if we use an edge-detecting filter and find a strong edge at a certain location, chances are that we’ll also find relatively strong edges at locations 1 pixel shifted from the original one. However, these are all the same edge! We’re not finding anything new.
Pooling layers solve this problem. All they do is reduce the size of the input it’s given by (you guessed it) pooling values together in the input. The pooling is usually done by a simple operation like max, min, or average. Here’s an example of a Max Pooling layer with a pooling size of 2:

Max Pooling (pool size 2) on a 4x4 image to produce a 2x2 output

To perform max pooling, we traverse the input image in 2x2 blocks (because pool size = 2) and put the max value into the output image at the corresponding pixel. That’s it!
Pooling divides the input’s width and height by the pool size. For our MNIST CNN, we’ll place a Max Pooling layer with a pool size of 2 right after our initial conv layer. The pooling layer will transform a 26x26x8 input into a 13x13x8 output:

4.1 Implementing Pooling

We’ll implement a MaxPool2 class with the same methods as our conv class from the previous section:

maxpool.py

import numpy as np

class MaxPool2:
  # A Max Pooling layer using a pool size of 2.

  def iterate_regions(self, image):
    '''
    Generates non-overlapping 2x2 image regions to pool over.
    - image is a 2d numpy array
    '''
    h, w, _ = image.shape
    new_h = h // 2
    new_w = h // 2

    for i in range(new_h):
      for j in range(new_w):
        im_region = image[(i * 2):(i * 2 + 2), (j * 2):(j * 2 + 2)]
        yield im_region, i, j

  def forward(self, input):
    '''
    Performs a forward pass of the maxpool layer using the given input.
    Returns a 3d numpy array with dimensions (h / 2, w / 2, num_filters).
    - input is a 3d numpy array with dimensions (h, w, num_filters)
    '''
    h, w, num_filters = input.shape
    output = np.zeros((h // 2, w // 2, num_filters))

    for im_region, i, j in self.iterate_regions(input):
      output[i, j] = np.amax(im_region, axis=(0, 1))
    return output

This class works similarly to the Conv3x3 class we implemented previously. The critical line is again highlighted: to find the max from a given image region, we use np.amax(), numpy’s array max method. We set axis=(0, 1) because we only want to maximize over the first two dimensions, height and width, and not the third, num_filters.
Let’s test it!

cnn.py

import mnist
from conv import Conv3x3
from maxpool import MaxPool2

# The mnist package handles the MNIST dataset for us!
# Learn more at https://github.com/datapythonista/mnist
train_images = mnist.train_images()
train_labels = mnist.train_labels()

conv = Conv3x3(8)
pool = MaxPool2()

output = conv.forward(train_images[0])
output = pool.forward(output)
print(output.shape) # (13, 13, 8)

Our MNIST CNN is starting to come together!

5. Softmax

To complete our CNN, we need to give it the ability to actually make predictions. We’ll do that by using the standard final layer for a multiclass classification problem: the Softmax layer, a standard fully-connected (dense) layer that uses the softmax activation function.

Reminder: fully-connected layers have every node connected to every output from the previous layer. We used fully-connected layers in my intro to Neural Networks if you need a refresher.

Softmax turns arbitrary real values into probabilities. The math behind it is pretty simple: given some numbers,

Raise e (the mathematical constant) to the power of each of those numbers.
Sum up all the exponentials (powers of $e$ ). This result is the denominator.
Use each number’s exponential as its numerator.
$\text{Probability} = \frac{\text{Numerator}}{\text{Denominator}}$ .

Written more fancily, Softmax performs the following transform on

n

numbers

x_1 \ldots x_n

s(x_i) = \frac{e^{x_i}}{\sum_{j=1}^n e^{x_j}}

The outputs of the Softmax transform are always in the range

[0, 1]

and add up to 1. Hence, they’re probabilities.
Here’s a simple example using the numbers -1, 0, 3, and 5:

\begin{aligned} \text{Denominator} &= e^{-1} + e^0 + e^3 + e^5 \\ &= \boxed{169.87} \\ \end{aligned}

$x$	$e^x$	Probability ( $\frac{e^x}{169.87}$ )
-1	0.368	0.002
0	1	0.006
3	20.09	0.118
5	148.41	0.874

5.1 Usage

We’ll use a softmax layer with 10 nodes, one representing each digit, as the final layer in our CNN. Each node in the layer will be connected to every input. After the softmax transformation is applied, the digit represented by the node with the highest probability will be the output of the CNN!

5.2 Cross-Entropy Loss

You might have just thought to yourself, why bother transforming the outputs into probabilities? Won’t the highest output value always have the highest probability? If you did, you’re absolutely right. We don’t actually need to use softmax to predict a digit - we could just pick the digit with the highest output from the network!
What softmax really does is help us quantify how sure we are of our prediction, which is useful when training and evaluating our CNN. More specifically, using softmax lets us use cross-entropy loss, which takes into account how sure we are of each prediction. Here’s how we calculate cross-entropy loss:

L = -\ln(p_c)

where

c

is the correct class (in our case, the correct digit),

p_c

is the predicted probability for class

c

, and

\ln

is the natural log. As always, a lower loss is better. For example, in the best case, we’d have

p_c = 1, L = -\ln(1) = 0

In a more realistic case, we might have

p_c = 0.8, L = -\ln(0.8) = 0.223

We’ll be seeing cross-entropy loss again later on in this post, so keep it in mind!

5.3 Implementing Softmax

You know the drill by now - let’s implement a Softmax layer class:

softmax.py

import numpy as np

class Softmax:
  # A standard fully-connected layer with softmax activation.

  def __init__(self, input_len, nodes):
    # We divide by input_len to reduce the variance of our initial values
    self.weights = np.random.randn(input_len, nodes) / input_len
    self.biases = np.zeros(nodes)

  def forward(self, input):
    '''
    Performs a forward pass of the softmax layer using the given input.
    Returns a 1d numpy array containing the respective probability values.
    - input can be any array with any dimensions.
    '''
    input = input.flatten()

    input_len, nodes = self.weights.shape

    totals = np.dot(input, self.weights) + self.biases
    exp = np.exp(totals)
    return exp / np.sum(exp, axis=0)

There’s nothing too complicated here. A few highlights:

We flatten() the input to make it easier to work with, since we no longer need its shape.
np.dot() multiplies input and self.weights element-wise and then sums the results.
np.exp() calculates the exponentials used for Softmax.

We’ve now completed the entire forward pass of our CNN! Putting it together:

cnn.py

import mnist
import numpy as np
from conv import Conv3x3
from maxpool import MaxPool2
from softmax import Softmax

# We only use the first 1k testing examples (out of 10k total)
# in the interest of time. Feel free to change this if you want.
test_images = mnist.test_images()[:1000]
test_labels = mnist.test_labels()[:1000]

conv = Conv3x3(8)                  # 28x28x1 -> 26x26x8
pool = MaxPool2()                  # 26x26x8 -> 13x13x8
softmax = Softmax(13 * 13 * 8, 10) # 13x13x8 -> 10

def forward(image, label):
  '''
  Completes a forward pass of the CNN and calculates the accuracy and
  cross-entropy loss.
  - image is a 2d numpy array
  - label is a digit
  '''
  # We transform the image from [0, 255] to [-0.5, 0.5] to make it easier
  # to work with. This is standard practice.
  out = conv.forward((image / 255) - 0.5)
  out = pool.forward(out)
  out = softmax.forward(out)

  # Calculate cross-entropy loss and accuracy. np.log() is the natural log.
  loss = -np.log(out[label])
  acc = 1 if np.argmax(out) == label else 0

  return out, loss, acc

print('MNIST CNN initialized!')

loss = 0
num_correct = 0
for i, (im, label) in enumerate(zip(test_images, test_labels)):
  # Do a forward pass.
  _, l, acc = forward(im, label)
  loss += l
  num_correct += acc

  # Print stats every 100 steps.
  if i % 100 == 99:
    print(
      '[Step %d] Past 100 steps: Average Loss %.3f | Accuracy: %d%%' %
      (i + 1, loss / 100, num_correct)
    )
    loss = 0
    num_correct = 0

Running cnn.py gives us output similar to this:

MNIST CNN initialized!
[Step 100] Past 100 steps: Average Loss 2.302 | Accuracy: 11%
[Step 200] Past 100 steps: Average Loss 2.302 | Accuracy: 8%
[Step 300] Past 100 steps: Average Loss 2.302 | Accuracy: 3%
[Step 400] Past 100 steps: Average Loss 2.302 | Accuracy: 12%

This makes sense: with random weight initialization, you’d expect the CNN to be only as good as random guessing. Random guessing would yield 10% accuracy (since there are 10 classes) and a cross-entropy loss of

{-\ln(0.1)} = 2.302

, which is what we get!
Want to try or tinker with this code yourself? Run this CNN in your browser. It’s also available on Github.

6. Conclusion

That’s the end of this introduction to CNNs! In this post, we

Motivated why CNNs might be more useful for certain problems, like image classification.
Introduced the MNIST handwritten digit dataset.
Learned about Conv layers, which convolve filters with images to produce more useful outputs.
Talked about Pooling layers, which can help prune everything but the most useful features.
Implemented a Softmax layer so we could use cross-entropy loss.

There’s still much more that we haven’t covered yet, such as how to actually train a CNN. Part 2 of this CNN series will do a deep-dive on training a CNN, including deriving gradients and implementing backprop. Subscribe to my newsletter if you want to get an email when Part 2 comes out (soon)!
If you’re eager to see a trained CNN in action: this example Keras CNN trained on MNIST achieves 99.25% accuracy. CNNs are powerful!