TL;DR let’s train a network on a rare visual language together—join us!
Weights
& Biases makes collaborative deep learning easy for teams by
organizing experiments and notes in a shared workspace, tracking all the
code, standardizing all other inputs and outputs, and handling the
boring parts (like plotting results) so you can focus on the most
exciting part—solving an interesting and meaningful puzzle in
conversation with others.
Kmnist Benchmark
With
public benchmarks, we want to explore collaboration at the broader
community level: how can individual effort and ideas be maximally
accessible to and useful for the field? We add wandb to the Kuzushiji-MNIST dataset
(kmnist): images of 10 different characters of a classical Japanese
cursive script. In three commands (at the bottom of this post), you can
set up and run your own version, play with hyperparameters, and
visualize model performance in your browser.
Dataset
We
chose this dataset because it is a fresh reimagining of the well-known
baseline of handwritten digits (mnist). It preserves the technical
simplicity of mnist and offers more creative headroom, since the
solution space is less explored and visual intuition is unreliable (only
a few experts can read Kuzushiji). Mnist generalization ends at 10
digits; kmnist extends to Kuzushiji-49 (270,912 images, 49 characters)
and the heavily-imbalanced Kuzushiji-Kanji (140,426 images, 3832
characters, some with 12 distinct variants). While mnist is basically
solved, kmnist can help us understand the structure of a disappearing
language and digitize ~300,000 old Japanese books (see this paper for more details).
Small incentive
To
incentivize initial work on the kmnist benchmark, we’re offering $1000
in compute credits to the contributor of the highest validation accuracy
within six weeks (by July 8th). We hope you will use those to make
something awesome!
The larger incentive
We are
developing benchmarks to encourage clear documentation, synthesis of
background and new ideas, and compression of research effort. We can
build better and faster by starting with a team at the top of the
collectively-reinforced foundation instead of alone at the bottom of a
pile of papers and blog posts. We hope you’ll help us nudge the machine
learning world in this direction by collaborating on the benchmark here.
How to start
We wanted to make it ridiculously easy to participate. You can go to https://app.wandb.ai/wandb/kmnist/benchmark or follow these commands:
1. Get the code and training data:
The vast majority of processors in the world are actually microcon-troller units (MCUs), which find widespread use performing simple con-trol tasks in applications ranging from automobiles to medical devices andoffice equipment. The Internet of Things (IoT) promises to inject machinelearning into many of these every-day objects via tiny, cheap MCUs. How-ever, these resource-impoverished hardware platforms severely limit thecomplexity of machine learning models that can be deployed.For exam-ple, although convolutional neural networks (CNNs) achieve state-of-the-art results on many visual recognition tasks, CNN inferenceon MCUs ischallenging due to severe finite memory limitations. To circumvent thememory challenge associated with CNNs, various alternatives have beenproposed that do fit within the memory budget of an MCU, albeitat thecost of prediction accuracy. This paper challenges the ideathat CNNs arenot suitable for deployment on MCUs. We demonstrate that it is possi-ble to automatically design CNNs which generalize well, while also beingsmall enough to fit onto memory-limited MCUs. Our Sparse ArchitectureSearch method combines neural architecture search with pruning in a sin-gle, unified approach, which learns superior models on four popular IoTdatasets. The CNNs we find are more accurate and up to4.35×smallerthan previous approaches, while meeting the strict MCU working memoryconstraint.
Code
isn’t just meant to be executed. Code is also a means of communication
across a team, a way to describe to others the solution to a problem.
Readable code is not a nice-to-have, it is a fundamental part of what
writing code is about. This involves factoring code clearly, picking
self-explanatory variable names, and inserting comments to describe
anything that’s implicit.
Ask
not what your pull request can do for your next promotion, ask what
your pull request can do for your users and your community. Avoid
“conspicuous contribution” at all cost. Let no feature be added if it
isn’t clearly helping with the purpose of your product.
Taste
applies to code, too. Taste is a constraint-satisfaction process
regularized by a desire for simplicity. Keep a bias toward simplicity.
It’s
okay to say no — just because someone asks for a feature doesn’t mean
you should do it. Every feature has a cost that goes beyond the initial
implementation: maintenance cost, documentation cost, and cognitive cost
for your users. Always ask: Should we really do this? Often, the answer
is simply no.
When
you say yes to a request for supporting a new use case, remember that
literally adding what the user requested is often not the optimal
choice. Users are focused on their own specific use case, and you must
counter this with a holistic and principled vision of the whole project.
Often, the right answer is to extend an existing feature.
Invest
in continuous integration and aim for full unit test coverage. Make
sure you are in an environment where you can code with confidence; if
that isn’t the case, start by focusing on building the right
infrastructure.
It’s
okay not to plan everything in advance. Try things and see how they
turn out. Revert incorrect choices early. Make sure you create an
environment where that is possible.
Good
software makes hard things easy. Just because a problem looks difficult
at first doesn’t mean the solution will have to be complex or hard to
use. Too often, engineers go with reflex solutions that introduce
undesirable complexity (Let’s use ML! Let’s build an app! Let’s add blockchain!)
in situations where a far easier, though maybe less obvious,
alternative is available. Before you write any code, make sure your
solution of choice cannot be made any simpler. Approach everything from
first principles.
Avoid
implicit rules. Implicit rules that you find yourself developing should
always be made explicit and shared with others or automated. Whenever
you find yourself coming up with a recurring, quasi-algorithmic
workflow, you should seek to formalize it into a documented process, so
that other team members will benefit from the experience. In addition,
you should seek to automate in software any part of such a workflow that
can be automated (e.g., correctness checks).
The
total impact of your choices should be taken into account in the design
process, not just the bits you want to focus on — such as revenue or
growth. Beyond the metrics you are monitoring, what total impact does
your software have on its users, on the world? Are there undesirable
side effects that outweigh the value proposition? What can you do to
address them while preserving the software’s usefulness?
Design for ethics. Bake your values into your creations.
On API Design
Your
API has users, thus it has a user experience. In every decision you
make, always keep the user in mind. Have empathy for your users, whether
they are beginners or experienced developers.
Always
seek to minimize the cognitive load imposed on your users in the course
of using your API. Automate what can be automated, minimize the actions
and choices needed from the user, don’t expose options that are
unimportant, design simple and consistent workflows that reflect simple
and consistent mental models.
Simple
things should be simple, complex things should be possible. Don’t
increase the cognitive load of common use cases for the sake of niche
use cases, even minimally.
If
the cognitive load of a workflow is sufficiently low, it should be
possible for a user to go through it from memory (without looking up a
tutorial or documentation) after having done it once or twice.
Seek
to have an API that matches the mental models of domain experts and
practitioners. Someone who has domain experience, but no experience with
your API, should be able to intuitively understand your API using
minimal documentation, mostly just by looking at a couple of code
examples and seeing what objects are available and what their signatures
are.
The
meaning of an argument should be understandable without having any
context about the underlying implementation. Arguments that have to be
specified by users should relate to the mental models that the users
have about the problem, not to implementation details in your code. An
API is all about the problem it solves, not about how the software works
in the background.
The
most powerful mental models are modular and hierarchical: simple at a
high level, yet precise as you need to go into details. In the same way,
a good API is modular and hierarchical: easy to approach, yet
expressive. There is a balance to strike between having complex
signatures on fewer objects, and having more objects with simpler
signatures. A good API has a reasonable number of objects, with
reasonably simple signatures.
Your
API is inevitably a reflection of your implementation choices, in
particular your choice of data structures. To achieve an intuitive API,
you must choose data structures that naturally fit the domain at
hand — that match the mental models of domain experts.
Deliberately design end-to-end workflows, not a set of atomic features. Most developers approach API design by asking: What capabilities should be available? Let’s have configuration options for them. Instead, ask: What
are the use cases for this tool? For each use case, what is the optimal
sequence of user actions? What’s the easiest API that could support
this workflow? Atomic options in your API should answer a clear
need that arises in a high-level workflow — they should not be added
“because someone might need it.”
Error
messages, and in general any feedback provided to a user in the course
of interacting with your API, is part of the API. Interactivity and
feedback are integral to the user experience. Design your API’s error
messages deliberately.
Because
code is communication, naming matters — whether naming a project or a
variable. Names reflect how you think about a problem. Avoid overly
generic names (x, variable, parameter), avoid OverlyLongAndSpecificNamingPatterns, avoid terms that can create unnecessary friction (master, slave),
and make sure you are consistent in your naming choices. Naming
consistency means both internal naming consistency (don’t call “dim”
what is called “axis” in other places) and consistency with established
conventions for the problem domain. Before settling on a name, make sure
to look up existing names used by domain experts (or other APIs).
Documentation
is central to the user experience of your API. It is not an add-on.
Invest in high-quality documentation; you will see higher returns than
investing in more features.
Show,
don’t tell: Your documentation should not talk about how the software
works, it should show how to use it. Show code examples for end-to-end
workflows; show code examples for each and every common use case and key
feature of your API.
Productivity boils down to high-velocity decision-making and a bias for action.
On Software Careers
Career
progress is not how many people you manage, it is how much of an impact
you make: the differential between a world with and without your work.
Software
development is teamwork; it is about relationships as much as it is
about technical ability. Be a good teammate. As you go on your way, stay
in touch with people.
Technology
is never neutral. If your work has any impact on the world, then this
impact has a moral direction. The seemingly innocuous technical choices
we make in software products modulate the terms of access to technology,
its usage incentives, who will benefit, and who will suffer. Technical
choices are also ethical choices. Thus, always be deliberate and
explicit about the values you want your choices to support. Design for
ethics. Bake your values into your creations. Never think, I’m just building the capability; that in itself is neutral. It is not because the way you build it determines how it will get used.
Self-direction — agency
over your work and your circumstances — is the key to life
satisfaction. Make sure you grant sufficient self-direction to the
people around you, and make sure your career choices result in greater
agency for yourself.
Build
what the world needs — not just what you wish you had. Too often,
technologists live rarefied lives and focus on products catering to
their own specific needs. Seek opportunities to broaden your life
experience, which will give you better visibility into what the world
needs.
When
making any choice with long-term repercussions, place your values above
short-term self-interest and passing emotions — such as greed or fear.
Know what your values are, and let them guide you.
When
we find ourselves in a conflict, it’s a good idea to pause to
acknowledge our shared values and our shared goals, and remind ourselves
that we are, almost certainly, on the same side.
Productivity boils down to high-velocity decision-making and a bias for action. This requires a) good intuition, which comes from experience, so as to make generally correct decisions given partial information, b)
a keen awareness of when to move more carefully and wait for more
information, because the cost of an incorrect decision would be greater
than cost of the delay. The optimal velocity/quality decision-making
tradeoff can vary greatly in different environments.
Making
decisions faster means you make more decisions over the course of your
career, which will give you stronger intuition about the correctness of
available options. Experience is key to productivity, and greater
productivity will provide you with more experience: a virtuous cycle.
In
situations where you are aware that your intuition is lacking, adhere
to abstract principles. Build up lists of tried-and-true principles
throughout your career. Principles are formalized intuition that
generalize to a broader range of situations than raw pattern recognition
(which requires direct and extensive experience of similar situations).
Posted by Mingxing Tan, Staff Software Engineer and Quoc V. Le, Principal Scientist, Google AIConvolutional neural networks
(CNNs) are commonly developed at a fixed resource cost, and then scaled
up in order to achieve better accuracy when more resources are made
available. For example, ResNet can be scaled up from ResNet-18 to ResNet-200 by increasing the number of layers, and recently, GPipe achieved 84.3% ImageNet
top-1 accuracy by scaling up a baseline CNN by a factor of four. The
conventional practice for model scaling is to arbitrarily increase the
CNN depth or width, or to use larger input image resolution for training
and evaluation. While these methods do improve accuracy, they usually
require tedious manual tuning, and still often yield suboptimal
performance. What if, instead, we could find a more principled method to
scale up a CNN to obtain better accuracy and efficiency?
In our ICML 2019 paper, “EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks”, we propose a novel model scaling method that uses a simple yet highly effective compound coefficient
to scale up CNNs in a more structured manner. Unlike conventional
approaches that arbitrarily scale network dimensions, such as width,
depth and resolution, our method uniformly scales each dimension with a
fixed set of scaling coefficients. Powered by this novel scaling method
and recent progress on AutoML,
we have developed a family of models, called EfficientNets, which
superpass state-of-the-art accuracy with up to 10x better efficiency
(smaller and faster).
Compound Model Scaling: A Better Way to Scale Up CNNs
In order to understand the effect of scaling the network, we
systematically studied the impact of scaling different dimensions of the
model. While scaling individual dimensions improves model performance,
we observed that balancing all dimensions of the network—width, depth,
and image resolution—against the available resources would best improve
overall performance.
The first step in the compound scaling method is to perform a grid search
to find the relationship between different scaling dimensions of the
baseline network under a fixed resource constraint (e.g., 2x more FLOPS).This
determines the appropriate scaling coefficient for each of the
dimensions mentioned above. We then apply those coefficients to scale up
the baseline network to the desired target model size or computational
budget.
Comparison of
different scaling methods. Unlike conventional scaling methods (b)-(d)
that arbitrary scale a single dimension of the network, our compound
scaling method uniformly scales up all dimensions in a principled way.
This compound scaling method consistently improves model accuracy and efficiency for scaling up existing models such as MobileNet (+1.4% imagenet accuracy), and ResNet (+0.7%), compared to conventional scaling methods.
EfficientNet Architecture
The effectiveness of model scaling also relies heavily on the baseline
network. So, to further improve performance, we have also developed a
new baseline network by performing a neural architecture search using the AutoML MNAS framework,
which optimizes both accuracy and efficiency (FLOPS). The resulting
architecture uses mobile inverted bottleneck convolution (MBConv),
similar to MobileNetV2 and MnasNet,
but is slightly larger due to an increased FLOP budget. We then scale
up the baseline network to obtain a family of models, called EfficientNets.
The architecture for our baseline network EfficientNet-B0 is simple and clean, making it easier to scale and generalize.
EfficientNet Performance
We have compared our EfficientNets with other existing CNNs on ImageNet.
In general, the EfficientNet models achieve both higher accuracy and
better efficiency over existing CNNs, reducing parameter size and FLOPS
by an order of magnitude. For example, in the high-accuracy regime, our
EfficientNet-B7 reaches state-of-the-art 84.4% top-1 / 97.1% top-5
accuracy on ImageNet, while being 8.4x smaller and 6.1x faster on CPU
inference than the previous Gpipe. Compared with the widely used ResNet-50, our EfficientNet-B4 uses similar FLOPS, while improving the top-1 accuracy from 76.3% of ResNet-50 to 82.6% (+6.3%).
Model Size vs. Accuracy Comparison. EfficientNet-B0 is the baseline network developed by AutoML MNAS,
while Efficient-B1 to B7 are obtained by scaling up the baseline
network. In particular, our EfficientNet-B7 achieves new
state-of-the-art 84.4% top-1 / 97.1% top-5 accuracy, while being 8.4x
smaller than the best existing CNN.
Though EfficientNets perform well on ImageNet, to be
most useful, they should also transfer to other datasets. To evaluate
this, we tested EfficientNets on eight widely used transfer learning
datasets. EfficientNets achieved state-of-the-art accuracy in 5 out of
the 8 datasets, such as CIFAR-100 (91.7%) and Flowers
(98.8%), with an order of magnitude fewer parameters (up to 21x
parameter reduction), suggesting that our EfficientNets also transfer
well.
By providing significant improvements to model efficiency, we expect
EfficientNets could potentially serve as a new foundation for future
computer vision tasks. Therefore, we have open-sourced all EfficientNet
models, which we hope can benefit the larger machine learning community.
You can find the EfficientNet source code and TPU training scripts here.
Acknowledgements:Special thanks to Hongkun Yu, Ruoming Pang, Vijay Vasudevan, Alok
Aggarwal, Barret Zoph, Xianzhi Du, Xiaodan Song, Samy Bengio, Jeff Dean,
and the Google Brain team.
Pythia is a modular framework for vision and language multimodal research. Built on top
of PyTorch, it features:
Model Zoo: Reference implementations for state-of-the-art vision and language model including
LoRRA (SoTA on VQA and TextVQA),
Pythia model (VQA 2018 challenge winner) and BAN.
Multi-Tasking: Support for multi-tasking which allows training on multiple dataset together.
Datasets: Includes support for various datasets built-in including VQA, VizWiz, TextVQA and VisualDialog.
Modules: Provides implementations for many commonly used layers in vision and language domain
Distributed: Support for distributed training based on DataParallel as well as DistributedDataParallel.
Unopinionated: Unopinionated about the dataset and model implementations built on top of it.
Customization: Custom losses, metrics, scheduling, optimizers, tensorboard; suits all your custom needs.
You can use Pythia to bootstrap for your next vision and language multimodal research project.
Pythia can also act as starter codebase for challenges around vision and
language datasets (TextVQA challenge, VQA challenge)
A simple guide to what CNNs are, how they work, and how to build one from scratch in Python.
There’s
been a lot of buzz about Convolution Neural Networks (CNNs) in the past
few years, especially because of how they’ve revolutionized the field
of Computer Vision. In this post, we’ll build on a basic background knowledge of neural networks and explore what CNNs are, understand how they work, and build a real one from scratch (using only numpy) in Python. This post assumes only a basic knowledge of neural networks. My introduction to Neural Networks covers everything you’ll need to know, so you might want to read that first.
Ready? Let’s jump in.
1. Motivation
A classic use case of CNNs is to perform image classification, e.g.
looking at an image of a pet and deciding whether it’s a cat or a dog.
It’s a seemingly simple task - why not just use a normal Neural Network?
Good question.
Reason 1: Images are Big
Images used for Computer Vision problems nowadays are often 224x224
or larger. Imagine building a neural network to process 224x224 color
images: including the 3 color channels (RGB) in the image, that comes
out to 224 x 224 x 3 = 150,528 input features! A typical hidden layer in such a network might have 1024 nodes, so we’d have to train 150,528 x 1024 = 150+ million weights for the first layer alone. Our network would be huge and nearly impossible to train.
It’s not like we need that many weights, either. The nice thing about images is that we know pixels are most useful in the context of their neighbors. Objects in images are made up of small, localized features, like the circular iris of an eye or the square corner of a piece of paper. Doesn’t it seem wasteful for every node in the first hidden layer to look at every pixel?
Reason 2: Positions can change
If you trained a network to detect dogs, you’d want it to be able to a detect a dog regardless of where it appears in the image.
Imagine training a network that works well on a certain dog image, but
then feeding it a slightly shifted version of the same image. The dog
would not activate the same neurons, so the network would react completely differently!
We’ll see soon how a CNN can help us mitigate these problems.
2. Dataset
In this post, we’ll tackle the “Hello, World!” of Computer Vision: the MNIST handwritten digit classification problem. It’s simple: given an image, classify it as a digit.
Each image in the MNIST dataset is 28x28 and contains a centered, grayscale digit.
Truth be told, a normal neural network would actually work just fine
for this problem. You could treat each image as a 28 x 28 =
784-dimensional vector, feed that to a 784-dim input layer, stack a few
hidden layers, and finish with an output layer of 10 nodes, 1 for each
digit.
This would only work because the MNIST dataset contains small images that are centered,
so we wouldn’t run into the aforementioned issues of size or shifting.
Keep in mind throughout the course of this post, however, that most real-world image classification problems aren’t this easy.
Enough buildup. Let’s get into CNNs!
3. Convolutions
What are Convolutional Neural Networks?
They’re basically just neural networks that use Convolutional layers, a.k.a. Conv layers, which are based on the mathematical operation of convolution. Conv layers consist of a set of filters, which you can think of as just 2d matrices of numbers. Here’s an example 3x3 filter:
We can use an input image and a filter to produce an output image by convolving the filter with the input image. This consists of
Overlaying the filter on top of the image at some location.
Performing element-wise multiplication between the values in the filter and their corresponding values in the image.
Summing up all the element-wise products. This sum is the output value for the destination pixel in the output image.
Repeating for all locations.
Side Note: We (along with many CNN implementations) are technically actually using cross-correlation
instead of convolution here, but they do almost the same thing. I won’t
go into the difference in this post because it’s not that important,
but feel free to look this up if you’re curious.
That 4-step description was a little abstract, so let’s do an
example. Consider this tiny 4x4 grayscale image and this 3x3 filter:
The numbers in the image represent pixel intensities, where 0 is
black and 255 is white. We’ll convolve the input image and the filter to
produce a 2x2 output image:
To start, lets overlay our filter in the top left corner of the image:
Next, we perform element-wise multiplication between the overlapping
image values and filter values. Here are the results, starting from the
top left corner and going right, then down:
Image Value
Filter Value
Result
0
-1
0
50
0
0
0
1
0
0
-2
0
80
0
0
31
2
62
33
-1
-33
90
0
0
0
1
0
Step 2: Performing element-wise multiplication.
Next, we sum up all the results. That’s easy enough: 62−33=29
Finally, we place our result in the destination pixel of our output
image. Since our filter is overlayed in the top left corner of the input
image, our destination pixel is the top left pixel of the output image:
We do the same thing to generate the rest of the output image:
3.1 How is this useful?
Let’s zoom out for a second and see this at a higher level. What does
convolving an image with a filter do? We can start by using the example
3x3 filter we’ve been using, which is commonly known as the vertical Sobel filter:
Here’s an example of what the vertical Sobel filter does:
Similarly, there’s also a horizontal Sobel filter:
See what’s happening? Sobel filters are edge-detectors.
The vertical Sobel filter detects vertical edges, and the horizontal
Sobel filter detects horizontal edges. The output images are now easily
interpreted: a bright pixel (one that has a high value) in the output
image indicates that there’s a strong edge around there in the original
image.
Can you see why an edge-detected image might be more useful than the
raw image? Think back to our MNIST handwritten digit classification
problem for a second. A CNN trained on MNIST might look for the digit 1,
for example, by using an edge-detection filter and checking for two
prominent vertical edges near the center of the image. In general, convolution helps us look for specific localized image features (like edges) that we can use later in the network.
3.2 Padding
Remember convolving a 4x4 input image with a 3x3 filter earlier to
produce a 2x2 output image? Often times, we’d prefer to have the output
image be the same size as the input image. To do this, we add zeros
around the image so we can overlay the filter in more places. A 3x3
filter requires 1 pixel of padding:
This is called “same” padding, since the input and
output have the same dimensions. Not using any padding, which is what
we’ve been doing and will continue to do for this post, is sometimes
referred to as “valid” padding.
3.3 Conv Layers
Now that we know how image convolution works and why it’s useful,
let’s see how it’s actually used in CNNs. As mentioned before, CNNs
include conv layers that use a set of filters to turn input images into output images. A conv layer’s primary parameter is the number of filters it has.
For our MNIST CNN, we’ll use a small conv layer with 8 filters as the
initial layer in our network. This means it’ll turn the 28x28 input
image into a 26x26x8 output volume:
Reminder: The output is 26x26x8 and not 28x28x8 because we’re using valid padding, which decreases the input’s width and height by 2.
Each of the 4 filters in the conv layer produces a 26x26 output, so
stacked together they make up a 26x26x8 volume. All of this happens
because of 3 × 3 (filter size) × 8 (number of filters) = only 72 weights!
3.4 Implementing Convolution
Time to put what we’ve learned into code! We’ll implement a conv
layer’s feedforward portion, which takes care of convolving filters with
an input image to produce an output volume. For simplicity, we’ll
assume filters are always 3x3 (which is not true - 5x5 and 7x7 filters
are also very common).
Let’s start implementing a conv layer class:
conv.py
import numpy as np
classConv3x3:# A Convolution layer using 3x3 filters.def__init__(self, num_filters):
self.num_filters = num_filters
# filters is a 3d array with dimensions (num_filters, 3, 3)# We divide by 9 to reduce the variance of our initial values
self.filters = np.random.randn(num_filters,3,3)/9
The Conv3x3 class takes only one
argument: the number of filters. In the constructor, we store the number
of filters and initialize a random filters array using NumPy’s randn() method.
Note: Diving by 9 during the initialization is more important than
you may think. If the initial values are too large or too small,
training the network will be ineffective. To learn more, read about Xavier Initialization.
Next, the actual convolution:
conv.py
classConv3x3:# ...defiterate_regions(self, image):'''
Generates all possible 3x3 image regions using valid padding.
- image is a 2d numpy array
'''
h, w = image.shape
for i inrange(h -2):for j inrange(w -2):
im_region = image[i:(i +3), j:(j +3)]yield im_region, i, j
defforward(self,input):'''
Performs a forward pass of the conv layer using the given input.
Returns a 3d numpy array with dimensions (h, w, num_filters).
- input is a 2d numpy array
'''
h, w =input.shape
output = np.zeros((h -2, w -2, self.num_filters))for im_region, i, j in self.iterate_regions(input): output[i, j]= np.sum(im_region * self.filters, axis=(1,2))return output
iterate_regions() is a helper generator
method that yields all valid 3x3 image regions for us. This will be
useful for implementing the backwards portion of this class later on.
The line of code that actually performs the convolutions is highlighted above. Let’s break it down:
We have im_region, a 3x3 array containing the relevant image region.
We have self.filters, a 3d array.
We do im_region * self.filters, which uses numpy’s broadcasting feature to element-wise multiply the two arrays. The result is a 3d array with the same dimension as self.filters.
We np.sum() the result of the previous step using axis=(1,2), which produces a 1d array of length num_filters where each element contains the convolution result for the corresponding filter.
We assign the result to output[i, j], which contains convolution results for pixel (i, j) in the output.
The sequence above is performed for each pixel in the output until we
obtain our final output volume! Let’s give our code a test run:
cnn.py
import mnist
from conv import Conv3x3
# The mnist package handles the MNIST dataset for us!# Learn more at https://github.com/datapythonista/mnist
train_images = mnist.train_images()
train_labels = mnist.train_labels()
conv = Conv3x3(8)
output = conv.forward(train_images[0])print(output.shape)# (26, 26, 8)
Looks good so far.
Note: in our Conv3x3 implementation, we assume the input is a 2d
numpy array for simplicity, because that’s how our MNIST images are
stored. This works for us because we use it as the first layer in our
network, but most CNNs have many more Conv layers. If we were building a
bigger network that needed to use Conv3x3 multiple times, we’d have to make the input be a 3d numpy array.
4. Pooling
Neighboring pixels in images tend to have similar values, so conv
layers will typically also produce similar values for neighboring pixels
in outputs. As a result, much of the information contained in a conv layer’s output is redundant.
For example, if we use an edge-detecting filter and find a strong edge
at a certain location, chances are that we’ll also find relatively
strong edges at locations 1 pixel shifted from the original one.
However, these are all the same edge! We’re not finding anything new.
Pooling layers solve this problem. All they do is reduce the size of the input it’s given by (you guessed it) pooling values together in the input. The pooling is usually done by a simple operation like max, min, or average. Here’s an example of a Max Pooling layer with a pooling size of 2:
To perform max pooling, we traverse the input image in 2x2 blocks (because pool size = 2) and put the max value into the output image at the corresponding pixel. That’s it! Pooling divides the input’s width and height by the pool size.
For our MNIST CNN, we’ll place a Max Pooling layer with a pool size of 2
right after our initial conv layer. The pooling layer will transform a
26x26x8 input into a 13x13x8 output:
4.1 Implementing Pooling
We’ll implement a MaxPool2 class with the same methods as our conv class from the previous section:
maxpool.py
import numpy as np
classMaxPool2:# A Max Pooling layer using a pool size of 2.defiterate_regions(self, image):'''
Generates non-overlapping 2x2 image regions to pool over.
- image is a 2d numpy array
'''
h, w, _ = image.shape
new_h = h //2
new_w = h //2for i inrange(new_h):for j inrange(new_w):
im_region = image[(i *2):(i *2+2),(j *2):(j *2+2)]yield im_region, i, j
defforward(self,input):'''
Performs a forward pass of the maxpool layer using the given input.
Returns a 3d numpy array with dimensions (h / 2, w / 2, num_filters).
- input is a 3d numpy array with dimensions (h, w, num_filters)
'''
h, w, num_filters =input.shape
output = np.zeros((h //2, w //2, num_filters))for im_region, i, j in self.iterate_regions(input): output[i, j]= np.amax(im_region, axis=(0,1))return output
This class works similarly to the Conv3x3 class we implemented previously. The critical line is again highlighted: to find the max from a given image region, we use np.amax(), numpy’s array max method. We set axis=(0,1) because we only want to maximize over the first two dimensions, height and width, and not the third, num_filters.
Let’s test it!
cnn.py
import mnist
from conv import Conv3x3
from maxpool import MaxPool2
# The mnist package handles the MNIST dataset for us!# Learn more at https://github.com/datapythonista/mnist
train_images = mnist.train_images()
train_labels = mnist.train_labels()
conv = Conv3x3(8)
pool = MaxPool2()
output = conv.forward(train_images[0])
output = pool.forward(output)print(output.shape)# (13, 13, 8)
Our MNIST CNN is starting to come together!
5. Softmax
To complete our CNN, we need to give it the ability to actually make
predictions. We’ll do that by using the standard final layer for a
multiclass classification problem: the Softmax layer, a standard fully-connected (dense) layer that uses the softmax activation function.
Reminder: fully-connected layers have every node connected to every
output from the previous layer. We used fully-connected layers in my intro to Neural Networks if you need a refresher.
Softmax turns arbitrary real values into probabilities. The math behind it is pretty simple: given some numbers,
Raise e (the mathematical constant) to the power of each of those numbers.
Sum up all the exponentials (powers of e). This result is the denominator.
Use each number’s exponential as its numerator.
Probability=DenominatorNumerator.
Written more fancily, Softmax performs the following transform on n numbers x1…xn: s(xi)=∑j=1nexjexi
The outputs of the Softmax transform are always in the range [0,1] and add up to 1. Hence, they’re probabilities.
Here’s a simple example using the numbers -1, 0, 3, and 5: Denominator=e−1+e0+e3+e5=169.87
x
ex
Probability (169.87ex)
-1
0.368
0.002
0
1
0.006
3
20.09
0.118
5
148.41
0.874
5.1 Usage
We’ll use a softmax layer with 10 nodes, one representing each digit,
as the final layer in our CNN. Each node in the layer will be connected
to every input. After the softmax transformation is applied, the digit represented by the node with the highest probability will be the output of the CNN!
5.2 Cross-Entropy Loss
You might have just thought to yourself, why bother transforming the outputs into probabilities? Won’t the highest output value always have the highest probability? If you did, you’re absolutely right. We don’t actually need to use softmax to predict a digit - we could just pick the digit with the highest output from the network!
What softmax really does is help us quantify how sure we are of our prediction, which is useful when training and evaluating our CNN. More specifically, using softmax lets us use cross-entropy loss, which takes into account how sure we are of each prediction. Here’s how we calculate cross-entropy loss: L=−ln(pc)
where c is the correct class (in our case, the correct digit), pc is the predicted probability for class c, and ln is the natural log. As always, a lower loss is better. For example, in the best case, we’d have pc=1,L=−ln(1)=0
In a more realistic case, we might have pc=0.8,L=−ln(0.8)=0.223
We’ll be seeing cross-entropy loss again later on in this post, so keep it in mind!
5.3 Implementing Softmax
You know the drill by now - let’s implement a Softmax layer class:
softmax.py
import numpy as np
classSoftmax:# A standard fully-connected layer with softmax activation.def__init__(self, input_len, nodes):# We divide by input_len to reduce the variance of our initial values
self.weights = np.random.randn(input_len, nodes)/ input_len
self.biases = np.zeros(nodes)defforward(self,input):'''
Performs a forward pass of the softmax layer using the given input.
Returns a 1d numpy array containing the respective probability values.
- input can be any array with any dimensions.
'''input=input.flatten()
input_len, nodes = self.weights.shape
totals = np.dot(input, self.weights)+ self.biases
exp = np.exp(totals)return exp / np.sum(exp, axis=0)
There’s nothing too complicated here. A few highlights:
We flatten() the input to make it easier to work with, since we no longer need its shape.
np.dot() multiplies input and self.weights element-wise and then sums the results.
np.exp() calculates the exponentials used for Softmax.
We’ve now completed the entire forward pass of our CNN! Putting it together:
cnn.py
import mnist
import numpy as np
from conv import Conv3x3
from maxpool import MaxPool2
from softmax import Softmax
# We only use the first 1k testing examples (out of 10k total)# in the interest of time. Feel free to change this if you want.
test_images = mnist.test_images()[:1000]
test_labels = mnist.test_labels()[:1000]
conv = Conv3x3(8)# 28x28x1 -> 26x26x8
pool = MaxPool2()# 26x26x8 -> 13x13x8
softmax = Softmax(13*13*8,10)# 13x13x8 -> 10defforward(image, label):'''
Completes a forward pass of the CNN and calculates the accuracy and
cross-entropy loss.
- image is a 2d numpy array
- label is a digit
'''# We transform the image from [0, 255] to [-0.5, 0.5] to make it easier# to work with. This is standard practice.
out = conv.forward((image /255)-0.5)
out = pool.forward(out)
out = softmax.forward(out)# Calculate cross-entropy loss and accuracy. np.log() is the natural log.
loss =-np.log(out[label])
acc =1if np.argmax(out)== label else0return out, loss, acc
print('MNIST CNN initialized!')
loss =0
num_correct =0for i,(im, label)inenumerate(zip(test_images, test_labels)):# Do a forward pass.
_, l, acc = forward(im, label)
loss += l
num_correct += acc
# Print stats every 100 steps.if i %100==99:print('[Step %d] Past 100 steps: Average Loss %.3f | Accuracy: %d%%'%(i +1, loss /100, num_correct))
loss =0
num_correct =0
Running cnn.py gives us output similar to this:
MNIST CNN initialized!
[Step 100] Past 100 steps: Average Loss 2.302 | Accuracy: 11%
[Step 200] Past 100 steps: Average Loss 2.302 | Accuracy: 8%
[Step 300] Past 100 steps: Average Loss 2.302 | Accuracy: 3%
[Step 400] Past 100 steps: Average Loss 2.302 | Accuracy: 12%
This makes sense: with random weight initialization, you’d expect the
CNN to be only as good as random guessing. Random guessing would yield
10% accuracy (since there are 10 classes) and a cross-entropy loss of −ln(0.1)=2.302, which is what we get! Want to try or tinker with this code yourself? Run this CNN in your browser. It’s also available on Github.
6. Conclusion
That’s the end of this introduction to CNNs! In this post, we
Motivated why CNNs might be more useful for certain problems, like image classification.
Introduced the MNIST handwritten digit dataset.
Learned about Conv layers, which convolve filters with images to produce more useful outputs.
Talked about Pooling layers, which can help prune everything but the most useful features.
Implemented a Softmax layer so we could use cross-entropy loss.
There’s still much more that we haven’t covered yet, such as how to actually train a CNN. Part 2 of this CNN series will do a deep-dive on training a CNN, including deriving gradients and implementing backprop. Subscribe to my newsletter if you want to get an email when Part 2 comes out (soon)!
If you’re eager to see a trained CNN in action: this example Keras CNN trained on MNIST achieves 99.25% accuracy. CNNs are powerful!