The
ILSVRC saw an exponential decline in top 5 error rate for neural
network architecture for Image Classification over past few years
Deep Learning models for Image Classification have achieved an
exponential decline in error rate through last few years. Since then,
Deep Learning has become prime focus area for AI research. However, Deep
Learning has been around for a few decades now. Yann Lecun, presented a paper
pioneering the Convolutional Neural Networks (CNN) in 1998. But it
wasn’t until the start of the current decade that Deep Learning really
took off. The recent disruption can be attributed to increased
processing power (aka GPUs), the availability of abundant data (aka Imagenet dataset)
and new algorithms and techniques. It all started in 2012 with the
AlexNet, a large, deep Convolutional Neural Network which won the annual
ImageNet Large Scale Visual
Recognition Challenge (ILSVRC). ILSVRC is a competition where research
teams evaluate their algorithms on the given data set and compete to
achieve higher accuracy on several visual recognition tasks.
Since then, variants of CNNs have dominated the ILSVRC and have
surpassed the level of human accuracy, which is considered to lie in the
5-10% error range.
For us as humans, it very easy to understand contents of an image.
For example, while watching a movie (like Lord of The Rings) I just need
to see one example of a Dwarf and that allows me to identify other
dwarves without any effort. However, for a machine, the task is
extremely challenging because all it can see in an image is an array of
numbers. If the task is to identify a cat in an image, you can
appreciate the difficulty in finding a cat from this vast array of
numbers. Also, cats come in all shapes, sizes, colors and poses, making
the task even more challenging.
How we see objects vs how a machine sees them
Based on our experience with Deep Learning for more than four years
now, we are listing down some path breaking research papers that are a
must-read for anyone associated with computer vision. In this blog-post
we focus specifically on image classification and following posts will
cover other areas such as object detection and localization.
Also, we have added our two cents about some upcoming algorithms which
have the potential to shape the future of computer vision research.
Path-breaking Research Papers on Image Classification
In ILSVRC 2012, Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton
presented AlexNet, a deep CNN. AlexNet clocked a 15.4% error rate,
bettering the second best entry by more than 10% (The second best entry had the error rate of 26.2%). This
impressive feat by AlexNet took the whole Computer Vision community by
storm and made Deep Learning and CNNs the disruptions they are today.
An
illustration of the architecture of our CNN, explicitly showing the
delineation of responsibilities between the two GPUs. One GPU runs the
layer-parts at the top of the figure while the other runs the
layer-parts at the bottom.
This was the first time a model performed so well on a historically
difficult ImageNet dataset. AlexNet set the foundation of advanced Deep
Learning. It is still one of the highest cited paper concerning Deep
Learning, being cited about ~7000 times.
Matthew D Zeiler(Founder of Clarifai)
and Rob Fergus won the ILSVRC in 2013, outperforming AlexNet by
reducing the error rate to 11.2%. ZFNet introduced a novel visualization
technique that gives insight into the function of intermediate feature
layers and the operation of the classifier, both of which were missing
in AlexNet.
Network architecture of ZFNet
ZFNet opened the possibility of examining different feature
activations and their relation to the input space using a technique
called Deconvolutional Network.
Karen Simonyan and Andrew Zisserman of the University of Oxford
created a deep CNN that was chosen as the second best entry in Image
Classification task of ISLVRC 2014. VGG Net showed that a significant
improvement on the prior-art configurations can be achieved by
increasing the depth to 16-19 weight layers, which is substantially
deeper than what has been used in the prior art.
Macro-architecture of VGG Net. Credits: Davi Fossard
The architecture was praised because it was way simpler to understand
(simpler than GoogleLeNet, winner of ISLVRC 2014) but still could
manage optimum accuracy. Its feature maps are used a lot now in transfer
learning and other algorithms that require pre-trained networks, like
most GANs.
The winners of ISLVRC 2014, Christian Szegedy et al. presented a 22
layered neural network called GoogLeNet. It’s a type of Inception Model
and solidified Google’s position in the Computer Vision space. GoogleNet
clocked an error rate of 6.7%. The main hallmark of this architecture
is the improved utilization of the computing resources inside the
network. This was achieved by a carefully crafted design that allows for
increasing the depth and width of the network while keeping the
computational budget constant. GoogLeNet introduced the concept of
Inception module, where not everything is happening sequentially, as
seen in previous architectures but there are certain pieces of the
network that are happening in parallel.
A schematic representation of GoogLeNet architecture with the highlighted box being the inception module.
Noticeably, GoogLeNet’s error rate approached human performance (lies in the range 5-10%). GoogLeNet
was one of the first models which conceptualized that CNN layers didn’t
always have to be stacked up sequentially. The Inception module made
sure that a creative and careful structuring of layers improves
performance and computationally efficiency.
Microsoft’s ResNet, developed by Kaiming He, Xiangyu Zhang, Shaoqing
Ren and Jian Sun, is a residual learning framework to ease the training
of networks that are substantially deeper than those used previously.
The authors provided comprehensive empirical evidence showing that these
residual networks are easier to optimize, and can gain accuracy from
considerably increased depth.
A residual block in ResNet architecture.
ResNet surpassed human performance with an error rate of 3.57% with a
new 152 layer network architecture that set new records in
classification, detection, and localization through one incredible
architecture.
Sergey Zagoruyko and Nikos Komodakis presented this paper in 2016
with a detailed experimental study on the architecture of ResNet blocks,
based on which they propose a novel architecture where they decrease
depth of the entire network and increase width of residual networks.
Increasing width is using more feature maps in residual layers. Although
the common wisdom says that this might overfit the network, it actually
works.
Various residual blocks used by the authors
The authors named the resulting network structures Wide Residual
Networks (WRNs) and showed that these were far superior over their
commonly used thin and very deep counterparts. A Wide ResNet can have
2-12X more feature maps as compared to ResNet in its convolutional
layer.
ResNeXt secured second place in ILSCRV 2016. It is a simple highly
modularized network architecture for image classification. The ResNeXt
design results in a homogeneous, multi-branch architecture that have
only a few hyper-parameters to set.
A block of ResNeXt(right) compared to a block of ResNet(Left)
This strategy exposes a new dimension, which the authors named
“cardinality” (the size of the set of transformations), as an essential
factor in addition to the dimensions of depth and width. Increasing
cardinality is more effective than going deeper or wider when the
capacity is increased. Thus, it fared better than both ResNets and Wide
ResNets in accuracy.
Dense Convolutional Networks, developed by Gao Huang, Zhuang Liu,
Kilian Q. Weinberger and Laurens van der Maaten in 2016, connects each
layer to every other layer in a feed-forward fashion. For each layer,
the feature-maps of all preceding layers are used as inputs, and its own
feature-maps are used as inputs into all subsequent layers.
A 5-layer dense block. Each layer takes all preceding feature-maps as input.
DenseNets have several compelling advantages such as alleviating the
vanishing-gradient problem, strengthening the feature propagation,
encouraging feature reuse, and substantially reducing the number of
parameters. DenseNets outperformed ResNets whilst requiring less memory
and computation to achieve high performance.
New architectures with promising future potential
The variants of CNN are likely to dominate the Image Classification
architecture design. Attention Modules and SENets are going to become
more important in due course.
The winning entry of ILSCRV 2017, Squeeze-and-Excitation Networks (SENet),
works on Squeeze, Excitation and Scaling operations. Rather than
introducing a new spatial for the integration of feature channels,
SENets works on a new “feature re-calibration” strategy.
A schematic representation of SENet model: Squeeze, Excitation and Scaling Operations
The authors explicitly modeled the interdependence between feature
channels. SENets is trained to automatically obtain the importance of
each feature channel and use this importance to enhance useful features.
In the ILSVRC 2017 contest, SENet model obtained an incredible 2.251%
Top-5 error rate on the test set.
Residual Attention Network, a convolutional neural network using
attention mechanism which can incorporate with state-of-art feed forward
network architecture in an end-to-end training fashion. The attention
residual learning is used to train very deep Residual Attention Networks
which can be easily scaled up to hundreds of layers.
Residual
Attention Network Classification Illustration: Selected images
illustrating that different features have different corresponding
attention masks in Residual Attention Network. The sky mask diminishes
low-level background blue color features. The balloon instance mask
highlights high-level balloon bottom part features.
The Path Forward
Credits: Waitbutwhy
Today, the processing power of a computer you can buy for $1000 is
1/1000th of the capacity of the human brain. By Moore’s law, we will
reach computing power of human brain by 2025 and all of the humanity by
2050. AI’s effectiveness will only accelerate with time. As the
availability of data and processing power are no longer holding
researchers back, we can assume that the accuracy of Deep Learning
models used for Image Classification is going to get better in due
course. As a premier applied AI research group, we are here to be a part
of this revolution.
This comment has been removed by a blog administrator.
ReplyDelete