Monday, September 25, 2017

An Intuitive Explanation of Convolutional Neural Networks

https://ujjwalkarn.me/2016/08/11/intuitive-explanation-convnets/

What are Convolutional Neural Networks and why are they important?

Convolutional Neural Networks (ConvNets or CNNs) are a category of Neural Networks that have proven very effective in areas such as image recognition and classification. ConvNets have been successful in identifying faces, objects and traffic signs apart from powering vision in robots and self driving cars.
Screen Shot 2017-05-28 at 11.41.55 PM.png
Figure 1: Source [1]
In Figure 1 above, a ConvNet is able to recognize scenes and the system is able to suggest relevant captions (“a soccer player is kicking a soccer ball”) while Figure 2 shows an example of ConvNets being used for recognizing everyday objects, humans and animals. Lately, ConvNets have been effective in several Natural Language Processing tasks (such as sentence classification) as well.
Screen Shot 2016-08-07 at 4.17.11 PM.png
Figure 2: Source [2]
ConvNets, therefore, are an important tool for most machine learning practitioners today. However, understanding ConvNets and learning to use them for the first time can sometimes be an intimidating experience. The primary purpose of this blog post is to develop an understanding of how Convolutional Neural Networks work on images.
If you are new to neural networks in general, I would recommend reading this short tutorial on Multi Layer Perceptrons to get an idea about how they work, before proceeding. Multi Layer Perceptrons are referred to as “Fully Connected Layers” in this post.

The LeNet Architecture (1990s)

LeNet was one of the very first convolutional neural networks which helped propel the field of Deep Learning. This pioneering work by Yann LeCun was named LeNet5 after many previous successful iterations since the year 1988 [3]. At that time the LeNet architecture was used mainly for character recognition tasks such as reading zip codes, digits, etc.
Below, we will develop an intuition of how the LeNet architecture learns to recognize images. There have been several new architectures proposed in the recent years which are improvements over the LeNet, but they all use the main concepts from the LeNet and are relatively easier to understand if you have a clear understanding of the former.
Screen Shot 2016-08-07 at 4.59.29 PM.png
Figure 3: A simple ConvNet. Source [5]
The Convolutional Neural Network in Figure 3 is similar in architecture to the original LeNet and classifies an input image into four categories: dog, cat, boat or bird (the original LeNet was used mainly for character recognition tasks). As evident from the figure above, on receiving a boat image as input, the network correctly assigns the highest probability for boat (0.94) among all four categories. The sum of all probabilities in the output layer should be one (explained later in this post).
There are four main operations in the ConvNet shown in Figure 3 above:
  1. Convolution
  2. Non Linearity (ReLU)
  3. Pooling or Sub Sampling
  4. Classification (Fully Connected Layer)
These operations are the basic building blocks of every Convolutional Neural Network, so understanding how these work is an important step to developing a sound understanding of ConvNets. We will try to understand the intuition behind each of these operations below.

An Image is a matrix of pixel values

Essentially, every image can be represented as a matrix of pixel values.
8-gif.gif
Figure 4: Every image is a matrix of pixel values. Source [6]
Channel is a conventional term used to refer to a certain component of an image. An image from a standard digital camera will have three channels – red, green and blue – you can imagine those as three 2d-matrices stacked over each other (one for each color), each having pixel values in the range 0 to 255.
grayscale image, on the other hand, has just one channel. For the purpose of this post, we will only consider grayscale images, so we will have a single 2d matrix representing an image. The value of each pixel in the matrix will range from 0 to 255 – zero indicating black and 255 indicating white.

The Convolution Step

ConvNets derive their name from the “convolution” operator. The primary purpose of Convolution in case of a ConvNet is to extract features from the input image. Convolution preserves the spatial relationship between pixels by learning image features using small squares of input data. We will not go into the mathematical details of Convolution here, but will try to understand how it works over images.
As we discussed above, every image can be considered as a matrix of pixel values. Consider a 5 x 5 image whose pixel values are only 0 and 1 (note that for a grayscale image, pixel values range from 0 to 255, the green matrix below is a special case where pixel values are only 0 and 1):
Screen Shot 2016-07-24 at 11.25.13 PM
Also, consider another 3 x 3 matrix as shown below:
Screen Shot 2016-07-24 at 11.25.24 PM
Then, the Convolution of the 5 x 5 image and the 3 x 3 matrix can be computed as shown in the animation in Figure 5 below:Convolution_schematic
Figure 5: The Convolution operation. The output matrix is called Convolved Feature or Feature Map. Source [7]
Take a moment to understand how the computation above is being done. We slide the orange matrix over our original image (green) by 1 pixel (also called ‘stride’) and for every position, we compute element wise multiplication (between the two matrices) and add the multiplication outputs to get the final integer which forms a single element of the output matrix (pink). Note that the 3×3 matrix “sees” only a part of the input image in each stride.
In CNN terminology, the 3×3 matrix is called a ‘filter‘ or ‘kernel’ or ‘feature detector’ and the matrix formed by sliding the filter over the image and computing the dot product is called the ‘Convolved Feature’ or ‘Activation Map’ or the ‘Feature Map‘. It is important to note that filters acts as feature detectors from the original input image.
It is evident from the animation above that different values of the filter matrix will produce different Feature Maps for the same input image. As an example, consider the following input image:
111.png
In the table below, we can see the effects of convolution of the above image with different filters. As shown, we can perform operations such as Edge Detection, Sharpen and Blur just by changing the numeric values of our filter matrix before the convolution operation [8] – this means that different filters can detect different features from an image, for example edges, curves etc. More such examples are available in Section 8.2.4 here.
Screen Shot 2016-08-05 at 11.03.00 PM.png
Another good way to understand the Convolution operation is by looking at the animation in Figure 6 below:
giphy.gif
Figure 6: The Convolution Operation. Source [9]
A filter (with red outline) slides over the input image (convolution operation) to produce a feature map. The convolution of another filter (with the green outline), over the same image gives a different feature map as shown. It is important to note that the Convolution operation captures the local dependencies in the original image. Also notice how these two different filters generate different feature maps from the same original image. Remember that the image and the two filters above are just numeric matrices as we have discussed above.
In practice, a CNN learns the values of these filters on its own during the training process (although we still need to specify parameters such as number of filters, filter size, architecture of the network etc. before the training process). The more number of filters we have, the more image features get extracted and the better our network becomes at recognizing patterns in unseen images.
The size of the Feature Map (Convolved Feature) is controlled by three parameters [4] that we need to decide before the convolution step is performed:
  • Depth: Depth corresponds to the number of filters we use for the convolution operation. In the network shown in Figure 7, we are performing convolution of the original boat image using three distinct filters, thus producing three different feature maps as shown. You can think of these three feature maps as stacked 2d matrices, so, the ‘depth’ of the feature map would be three.
Screen Shot 2016-08-10 at 3.42.35 AM
Figure 7
  • Stride: Stride is the number of pixels by which we slide our filter matrix over the input matrix. When the stride is 1 then we move the filters one pixel at a time. When the stride is 2, then the filters jump 2 pixels at a time as we slide them around. Having a larger stride will produce smaller feature maps.
  • Zero-padding: Sometimes, it is convenient to pad the input matrix with zeros around the border, so that we can apply the filter to bordering elements of our input image matrix. A nice feature of zero padding is that it allows us to control the size of the feature maps. Adding zero-padding is also called wide convolution, and not using zero-padding would be a narrow convolution. This has been explained clearly in [14].

Introducing Non Linearity (ReLU)

An additional operation called ReLU has been used after every Convolution operation in Figure 3 above. ReLU stands for Rectified Linear Unit and is a non-linear operation. Its output is given by:
Screen Shot 2016-08-10 at 2.23.48 AM.png
Figure 8: the ReLU operation
ReLU is an element wise operation (applied per pixel) and replaces all negative pixel values in the feature map by zero. The purpose of ReLU is to introduce non-linearity in our ConvNet, since most of the real-world data we would want our ConvNet to learn would be non-linear (Convolution is a linear operation – element wise matrix multiplication and addition, so we account for non-linearity by introducing a non-linear function like ReLU).
The ReLU operation can be understood clearly from Figure 9 below. It shows the ReLU operation applied to one of the feature maps obtained in Figure 6 above. The output feature map here is also referred to as the ‘Rectified’ feature map.
Screen Shot 2016-08-07 at 6.18.19 PM.png
Figure 9: ReLU operation. Source [10]
Other non linear functions such as tanh or sigmoid can also be used instead of ReLU, but ReLU has been found to perform better in most situations.

The Pooling Step

Spatial Pooling (also called subsampling or downsampling) reduces the dimensionality of each feature map but retains the most important information. Spatial Pooling can be of different types: Max, Average, Sum etc.
In case of Max Pooling, we define a spatial neighborhood (for example, a 2×2 window) and take the largest element from the rectified feature map within that window. Instead of taking the largest element we could also take the average (Average Pooling) or sum of all elements in that window. In practice, Max Pooling has been shown to work better.
Figure 10 shows an example of Max Pooling operation on a Rectified Feature map (obtained after convolution + ReLU operation) by using a 2×2 window.
Screen Shot 2016-08-10 at 3.38.39 AM.png
Figure 10: Max Pooling. Source [4]
We slide our 2 x 2 window by 2 cells (also called ‘stride’) and take the maximum value in each region. As shown in Figure 10, this reduces the dimensionality of our feature map.
In the network shown in Figure 11, pooling operation is applied separately to each feature map (notice that, due to this, we get three output maps from three input maps).
Screen Shot 2016-08-07 at 6.19.37 PM.png
Figure 11: Pooling applied to Rectified Feature Maps
Figure 12 shows the effect of Pooling on the Rectified Feature Map we received after the ReLU operation in Figure 9 above.
Screen Shot 2016-08-07 at 6.11.53 PM.png
Figure 12: Pooling. Source [10]
The function of Pooling is to progressively reduce the spatial size of the input representation [4]. In particular, pooling
  • makes the input representations (feature dimension) smaller and more manageable
  • reduces the number of parameters and computations in the network, therefore, controlling overfitting [4]
  • makes the network invariant to small transformations, distortions and translations in the input image (a small distortion in input will not change the output of Pooling – since we take the maximum / average value in a local neighborhood).
  • helps us arrive at an almost scale invariant representation of our image (the exact term is “equivariant”). This is very powerful since we can detect objects in an image no matter where they are located (read [18] and [19] for details).

Story so far

Screen Shot 2016-08-08 at 2.26.09 AM.png
Figure 13
So far we have seen how Convolution, ReLU and Pooling work. It is important to understand that these layers are the basic building blocks of any CNN. As shown in Figure 13, we have two sets of Convolution, ReLU & Pooling layers – the 2nd Convolution layer performs convolution on the output of the first Pooling Layer using six filters to produce a total of six feature maps. ReLU is then applied individually on all of these six feature maps. We then perform Max Pooling operation separately on each of the six rectified feature maps.
Together these layers extract the useful features from the images, introduce non-linearity in our network and reduce feature dimension while aiming to make the features somewhat equivariant to scale and translation [18].
The output of the 2nd Pooling Layer acts as an input to the Fully Connected Layer, which we will discuss in the next section.

Fully Connected Layer

The Fully Connected layer is a traditional Multi Layer Perceptron that uses a softmax activation function in the output layer (other classifiers like SVM can also be used, but will stick to softmax in this post). The term “Fully Connected” implies that every neuron in the previous layer is connected to every neuron on the next layer. I recommend reading this post if you are unfamiliar with Multi Layer Perceptrons.
The output from the convolutional and pooling layers represent high-level features of the input image. The purpose of the Fully Connected layer is to use these features for classifying the input image into various classes based on the training dataset. For example, the image classification task we set out to perform has four possible outputs as shown in Figure 14 below (note that Figure 14 does not show connections between the nodes in the fully connected layer)
Screen Shot 2016-08-06 at 12.34.02 AM.png
Figure 14: Fully Connected Layer -each node is connected to every other node in the adjacent layer
Apart from classification, adding a fully-connected layer is also a (usually) cheap way of learning non-linear combinations of these features. Most of the features from convolutional and pooling layers may be good for the classification task, but combinations of those features might be even better [11].
The sum of output probabilities from the Fully Connected Layer is 1. This is ensured by using the Softmax as the activation function in the output layer of the Fully Connected Layer. The Softmax function takes a vector of arbitrary real-valued scores and squashes it to a vector of values between zero and one that sum to one.

Putting it all together – Training using Backpropagation

As discussed above, the Convolution + Pooling layers act as Feature Extractors from the input image while Fully Connected layer acts as a classifier.
Note that in Figure 15 below, since the input image is a boat, the target probability is 1 for Boat class and 0 for other three classes, i.e.
  • Input Image = Boat
  • Target Vector = [0, 0, 1, 0]
Screen Shot 2016-08-07 at 9.15.21 PM.png
Figure 15: Training the ConvNet
The overall training process of the Convolution Network may be summarized as below:
  • Step1: We initialize all filters and parameters / weights with random values
  • Step2: The network takes a training image as input, goes through the forward propagation step (convolution, ReLU and pooling operations along with forward propagation in the Fully Connected layer) and finds the output probabilities for each class.
    • Lets say the output probabilities for the boat image above are [0.2, 0.4, 0.1, 0.3]
    • Since weights are randomly assigned for the first training example, output probabilities are also random.
  • Step3: Calculate the total error at the output layer (summation over all 4 classes)
    •  Total Error = ∑  ½ (target probability – output probability) ²
  • Step4: Use Backpropagation to calculate the gradients of the error with respect to all weights in the network and use gradient descent to update all filter values / weights and parameter values to minimize the output error.
    • The weights are adjusted in proportion to their contribution to the total error.
    • When the same image is input again, output probabilities might now be [0.1, 0.1, 0.7, 0.1], which is closer to the target vector [0, 0, 1, 0].
    • This means that the network has learnt to classify this particular image correctly by adjusting its weights / filters such that the output error is reduced.
    • Parameters like number of filters, filter sizes, architecture of the network etc. have all been fixed before Step 1 and do not change during training process – only the values of the filter matrix and connection weights get updated.
  • Step5: Repeat steps 2-4 with all images in the training set.
The above steps train the ConvNet – this essentially means that all the weights and parameters of the ConvNet have now been optimized to correctly classify images from the training set.
When a new (unseen) image is input into the ConvNet, the network would go through the forward propagation step and output a probability for each class (for a new image, the output probabilities are calculated using the weights which have been optimized to correctly classify all the previous training examples). If our training set is large enough, the network will (hopefully) generalize well to new images and classify them into correct categories.
Note 1: The steps above have been oversimplified and mathematical details have been avoided to provide intuition into the training process. See [4] and [12] for a mathematical formulation and thorough understanding.
Note 2: In the example above we used two sets of alternating Convolution and Pooling layers. Please note however, that these operations can be repeated any number of times in a single ConvNet. In fact, some of the best performing ConvNets today have tens of Convolution and Pooling layers! Also, it is not necessary to have a Pooling layer after every Convolutional Layer. As can be seen in the Figure 16 below, we can have multiple Convolution + ReLU operations in succession before having a Pooling operation. Also notice how each layer of the ConvNet is visualized in the Figure 16 below.
car.png
Figure 16: Source [4]

Visualizing Convolutional Neural Networks

In general, the more convolution steps we have, the more complicated features our network will be able to learn to recognize. For example, in Image Classification a ConvNet may learn to detect edges from raw pixels in the first layer, then use the edges to detect simple shapes in the second layer, and then use these shapes to deter higher-level features, such as facial shapes in higher layers [14]. This is demonstrated in Figure 17 below – these features were learnt using a Convolutional Deep Belief Network and the figure is included here just for demonstrating the idea (this is only an example: real life convolution filters may detect objects that have no meaning to humans).
Screen Shot 2016-08-10 at 12.58.30 PM.png
Figure 17: Learned features from a Convolutional Deep Belief Network. Source [21]
Adam Harley created amazing visualizations of a Convolutional Neural Network trained on the MNIST Database of handwritten digits [13]. I highly recommend playing around with it to understand details of how a CNN works.
We will see below how the network works for an input ‘8’. Note that the visualization in Figure 18 does not show the ReLU operation separately.
conv_all.png
Figure 18: Visualizing a ConvNet trained on handwritten digits. Source [13]
The input image contains 1024 pixels (32 x 32 image) and the first Convolution layer (Convolution Layer 1) is formed by convolution of six unique 5 × 5 (stride 1) filters with the input image. As seen, using six different filters produces a feature map of depth six.
Convolutional Layer 1 is followed by Pooling Layer 1 that does 2 × 2 max pooling (with stride 2) separately over the six feature maps in Convolution Layer 1. You can move your mouse pointer over any pixel in the Pooling Layer and observe the 2 x 2 grid it forms in the previous Convolution Layer (demonstrated in Figure 19). You’ll notice that the pixel having the maximum value (the brightest one) in the 2 x 2 grid makes it to the Pooling layer.
Screen Shot 2016-08-06 at 12.45.35 PM.png
Figure 19: Visualizing the Pooling Operation. Source [13]
Pooling Layer 1 is followed by sixteen 5 × 5 (stride 1) convolutional filters that perform the convolution operation. This is followed by Pooling Layer 2 that does 2 × 2 max pooling (with stride 2). These two layers use the same concepts as described above.
We then have three fully-connected (FC) layers. There are:
  • 120 neurons in the first FC layer
  • 100 neurons in the second FC layer
  • 10 neurons in the third FC layer corresponding to the 10 digits – also called the Output layer
Notice how in Figure 20, each of the 10 nodes in the output layer are connected to all 100 nodes in the 2nd Fully Connected layer (hence the name Fully Connected).
Also, note how the only bright node in the Output Layer corresponds to ‘8’ – this means that the network correctly classifies our handwritten digit (brighter node denotes that the output from it is higher, i.e. 8 has the highest probability among all other digits).
final.png
Figure 20: Visualizing the Filly Connected Layers. Source [13]
The 3d version of the same visualization is available here.

Other ConvNet Architectures

Convolutional Neural Networks have been around since early 1990s. We discussed the LeNet above which was one of the very first convolutional neural networks. Some other influential architectures are listed below [3] [4].
  • LeNet (1990s): Already covered in this article.
  • 1990s to 2012: In the years from late 1990s to early 2010s convolutional neural network were in incubation. As more and more data and computing power became available, tasks that convolutional neural networks could tackle became more and more interesting.
  • AlexNet (2012) – In 2012, Alex Krizhevsky (and others) released AlexNet which was a deeper and much wider version of the LeNet and won by a large margin the difficult ImageNet Large Scale Visual Recognition Challenge (ILSVRC) in 2012. It was a significant breakthrough with respect to the previous approaches and the current widespread application of CNNs can be attributed to this work.
  • ZF Net (2013) – The ILSVRC 2013 winner was a Convolutional Network from Matthew Zeiler and Rob Fergus. It became known as the ZFNet (short for Zeiler & Fergus Net). It was an improvement on AlexNet by tweaking the architecture hyperparameters.
  • GoogLeNet (2014) – The ILSVRC 2014 winner was a Convolutional Network from Szegedy et al. from Google. Its main contribution was the development of an Inception Module that dramatically reduced the number of parameters in the network (4M, compared to AlexNet with 60M).
  • VGGNet (2014) – The runner-up in ILSVRC 2014 was the network that became known as the VGGNet. Its main contribution was in showing that the depth of the network (number of layers) is a critical component for good performance.
  • ResNets (2015) – Residual Network developed by Kaiming He (and others) was the winner of ILSVRC 2015. ResNets are currently by far state of the art Convolutional Neural Network models and are the default choice for using ConvNets in practice (as of May 2016).
  • DenseNet (August 2016) – Recently published by Gao Huang (and others), the Densely Connected Convolutional Network has each layer directly connected to every other layer in a feed-forward fashion. The DenseNet has been shown to obtain significant improvements over previous state-of-the-art architectures on five highly competitive object recognition benchmark tasks. Check out the Torch implementation here.

Conclusion

In this post, I have tried to explain the main concepts behind Convolutional Neural Networks in simple terms. There are several details I have oversimplified / skipped, but hopefully this post gave you some intuition around how they work.
This post was originally inspired from Understanding Convolutional Neural Networks for NLP by Denny Britz (which I would recommend reading) and a number of explanations here are based on that post. For a more thorough understanding of some of these concepts, I would encourage you to go through the notes from Stanford’s course on ConvNets as well as other excellent resources mentioned under References below. If you face any issues understanding any of the above concepts or have questions / suggestions, feel free to leave a comment below.
All images and animations used in this post belong to their respective authors as listed in References section below.

References

  1. karpathy/neuraltalk2: Efficient Image Captioning code in Torch, Examples
  2. Shaoqing Ren, et al, “Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks”, 2015, arXiv:1506.01497 
  3. Neural Network Architectures, Eugenio Culurciello’s blog
  4. CS231n Convolutional Neural Networks for Visual Recognition, Stanford
  5. Clarifai / Technology
  6. Machine Learning is Fun! Part 3: Deep Learning and Convolutional Neural Networks
  7. Feature extraction using convolution, Stanford
  8. Wikipedia article on Kernel (image processing) 
  9. Deep Learning Methods for Vision, CVPR 2012 Tutorial 
  10. Neural Networks by Rob Fergus, Machine Learning Summer School 2015
  11. What do the fully connected layers do in CNNs? 
  12. Convolutional Neural Networks, Andrew Gibiansky 
  13. A. W. Harley, “An Interactive Node-Link Visualization of Convolutional Neural Networks,” in ISVC, pages 867-877, 2015 (link). Demo
  14. Understanding Convolutional Neural Networks for NLP
  15. Backpropagation in Convolutional Neural Networks
  16. A Beginner’s Guide To Understanding Convolutional Neural Networks
  17. Vincent Dumoulin, et al, “A guide to convolution arithmetic for deep learning”, 2015, arXiv:1603.07285
  18. What is the difference between deep learning and usual machine learning?
  19. How is a convolutional neural network able to learn invariant features?
  20. A Taxonomy of Deep Convolutional Neural Nets for Computer Vision
  21. Honglak Lee, et al, “Convolutional Deep Belief Networks for Scalable Unsupervised Learning of Hierarchical Representations” (link)


Wednesday, September 20, 2017

Cheat Sheets for AI, Neural Networks, Machine Learning, Deep Learning & Big Data

https://becominghuman.ai/cheat-sheets-for-ai-neural-networks-machine-learning-deep-learning-big-data-678c51b4b463

Cheat Sheets for AI, Neural Networks, Machine Learning, Deep Learning & Big Data

The Most Complete List of Best AI Cheat Sheets

Over the past few months, I have been collecting AI cheat sheets. From time to time I share them with friends and colleagues and recently I have been getting asked a lot, so I decided to organize and share the entire collection. To make things more interesting and give context, I added descriptions and/or excerpts for each major topic.
This is the most complete list and the Big-O is at the very end, enjoy…

Wednesday, September 6, 2017

Tutorial on Hardware Architectures for Deep Neural Networks

Kalau hardware begini nampaknya cocok untuk PME: "Tutorial on Hardware Architectures for Deep Neural Networks http://eyeriss.mit.edu/tutorial.html"

Sunday, September 3, 2017

Ex-Baidu Scientist Blazes AI Shortcut

http://www.eetimes.com/document.asp?doc_id=1332226


Native support for 3D tensor operation
8/31/2017 05:31 PM EDT
3 comments
NO RATINGS
MADISON, Wis. — Ren Wu, formerly a distinguished scientist at Baidu, has pulled a new AI chip company out of his sleeve, called NovuMind, based in Santa Clara, Calif.
In an exclusive interview with EE Times, Wu discussed the startup’s developments and what he hopes to accomplish.
Established two years ago, with 50 people, including 35 engineers working in the U.S. and 15 in Beijing, NovuMind is testing what Wu describes as a minimalist approach to deep learning.
Rather than designing general-purpose deep-learning chips like those based on Nvidia GPUs or Cadence DSPs, NovuMind has focused exclusively on developing a deep learning accelerator chip that “will do inference very efficiently,” Wu told us.
NovuMind has designed an AI chip that uses only very small (3x3) convolution filters.
This approach might seem counterintuitive at a time when the pace of artificial intelligence has accelerated almost dizzyingly. Indeed, many competitors concerned with yet-to-emerge AI algorithms have set their sights on chips that are as programmable and powerful as possible.
In contrast, NovuMind is concentrating on “only the core of the neural network that is not likely to change,” said Wu. He explained that 5x5 convolution can be done by stacking two 3x3 filters with less computation, and 7x7 is possible by stacking three. “So, why bother with those other filters?”
The biggest problem with architectures like DSP and GPU in deep-learning accelerators on edge devices is “the very low utilization” of their processors, Wu said. NovuMind solves “this efficiency issue by using unique tensor processing architecture.”
Wu calls NovuMind’s idea — focused on the minimum set of convolutions in a neural network — “aggressive thinking.”  He said the mission of his new chip is to embed power-efficient AI everywhere.
The company’s first AI chip — designed for prototyping — is expected to be taped out before Christmas. Wu said by February next year he expects applications to be up and running on a 15 teraflops of performance (ToP) chip under 5 watts.
A second chip, designed to run under a watt, is due in mid-2018, he added.
NovuMind's new chip will support Tensorflow, caffe and torch models natively.
The endgame of Wu’s AI chip is to enable a tiny Internet-connected “edge” device to not only “see” but “think” (and recognize what it sees), without hogging the bandwidth going back to the data center. Wu calls it the Intelligent Internet of Things (I²oT).
Ren Wu
Ren Wu
For Wu, who hasn’t sought much publicity in the last few years, NovuMind presents, in a way, an opportunity for redemption.
Two years ago, Wu was let go by Baidu, after the Chinese search giant was disqualified from the ImageNet Large Scale Visual Recognition Challenge (ILSVRC) in 2015. Wu subsequently denied wrongdoing in what was then labeled as “Machine learning’s first cheating scandal.”
Speaking with EE Times, he declined to discuss that event, other than noting, “I think I was set up.”
In today’s hotly pursued market of deep-learning accelerators for edge devices, NovuMind is forging ahead. After raising $15.2 million in series A funding in December 2016, NovuMind is about to begin a second round of fundraising, said Wu. “That’s why I am in Beijing now,” he told me during a phone interview.



3D tensor operation
As Wu tells us, the key to deep-learning acceleration, especially on edge devices, is to maximize efficiency while minimizing latency. Naturally, many edge devices are constrained by cost and battery life. Latency has no place in drones and autonomous vehicles, since they must be able to recognize eminent danger without delays. 
Against that backdrop, Wu noted two existing solutions currently available for deep- learning acceleration in edge devices: DSP (such as CEVA and Tensillica) and GPU (such as Nvidia’s TX series).
As he explained, DSP was designed for digital filtering, using 1D multiplication-and-accumulation (MAC) to finish the task. The essence of GPU (and Tensor Processing Unit) operation is 2D general matrix multiplication (GEMM).
(Source: NovuMind)
Click here for larger image
(Source: NovuMind)
Click here for larger image
(Source: NovuMind)
Click here for larger image
(Source: NovuMind)
Click here for larger image
(Source: NovuMind)
Click here for larger image
(Source: NovuMind)
Click here for larger image
In Wu’s opinion, neither DSP nor GPU is efficient enough for deep-learning acceleration tasks. He explained that the state-of-the-art in deep-learning network model computation is 3D tensor operation. “Naturally, when you convert 3D tensor operation into 1D MAC operation (for DSP case) or 2D GEMM operation (for GPU case), you lose a lot of efficiency.”
Wu explained, “That’s why even though GPU and DSP claim high peak performance (~1-2 ToPS), its average performance when running a real deep learning network inference is only 20-30 percent of its peak performance in real time applications.”
He said much processing energy is wasted on memory access. On average, 70-80 performance of computation resources lie idle waiting for data from memory.
NovuMind uses what Wu described as “unique tensor processing architecture.”
In NovuMind’s chip architecture, 3D tensor operation is natively supported, he noted. This helps “greatly enhance efficiency in terms of both energy and silicon area.”
According to Wu, NovuMind’s architecture can achieve 75 to 90 percent of its peak performance in real applications.
Memory hierarchy
Wu claimed that NovuMind’s design “based on 3D tensor operation” has given its AI chip “a tremendous advantage.” He noted, “Because we work directly on 3D tensor and we don’t need to do the intermediate step to expand convolution into 2D matrix, we are able to save a lot of memory bandwidth, memory access energy.”



Trade-offs
Engineering is all about trade-offs. In pursuit of the power efficiency necessary for embedded AI, what did NovuMind have to give up in its AI chips?
Wu said, “We only support a limited set of topology, such as layers defined in VGG, RESNET network, and another small set of other network layers we think are important and relevant.”
He noted, “Our chip will compute these supported network layers very efficiently. It can still do other layers, but it is not as optimal.”
Asked about downside, he described NovuMind’s AI chip as “less general.” If the network contains many unsupported layers, “its performance is no longer competitive,” he said. But Wu is confident. “We believe, with our strong AI team and in-house training capabilities, we have covered all important layers relevant to real-world applications.”
ADVERTISING
We also asked what convinced NuvoMind that 3x3 filters are the way to go. Wu said, “I have to give credit to the original VGG paper and its authors.”
VGG is the Visual Geometry Group, Department of Engineering Science at the Oxford University. VGG researchers authored a paper entitled “Very Deep Convolutional Networks for Large-Scale Image Recognition” in 2015.
The VGG paper convinced Wu to map its architecture into hardware. Wu was surprised to find out how hardware-friendly it was. “This is one of the very rare cases that algorithm designers have come up such an elegant and hardware-friendly design. Just beautiful,” he said. Wu believes that other practical useful network topologies we see today are based on the work done by VGG.
Wu added, “Since 3x3 convolution is such an important building block, our design of course will make sure, do whatever we can, to make it as efficient as possible.”
Latency comparisons
Wu claims NovuMind's architecture also excels in latency compared to DSP and GPU.
He observed, “DSP is designed for stream data processing, and its latency is good.”
On the other hand, he noted, “GPU generally needs batch operation and its latency is poor (50-300 ms with batch size of 8-64),” making it difficult to meet real-time demands.
He explained that NovuMind architecture also uses stream-mode data processing (latency < 3 ms). “We can imagine when an autonomous car drives at 65 mph and needs to break at once, the latency advantage of NovuMind architecture over GPU translates into a range of 4.5-30 feet of distance.” He boasted, “This can make a big difference in an autonomous car.”
(Source: NovuMind)
Click here for larger image
(Source: NovuMind)
Click here for larger image
Roadmap
NovuMind’s first chip will be manufactured by an undisclosed foundry, using a 28nm process technology. The second chip — aimed at mid-2018 for tape-out — will be using a 16nm process technology, according to Wu.
Describing the first chip as produced for prototyping purposes, Wu posed several scenarios for its chip applications. One is a USB stick incorporating the NovuMind chip, thus making connected devices, such as connected cameras, AI-driven systems. Second, with the 15 teraflops of operation, the AI chip can be used in “autonomous cars,” Wu said. The third application, suggested by Wu, is using its AI chip for cloud acceleration.
GPUs used in data centers place limitations on rack space, Wu observed. Higher power dissipation — extra heat — coming from a GPU is its culprit. Although NovuMind’s AI chip is designed for edge devices, when put on a PCI-board inside a server, its tiny package can efficiently run a single application such as speech recognition, which must be processed at the data center.
But really, what sort of AI applications are best for NovuMind’s AI chip? Is NovuMind saying that its AI chip would be ideal for, say, pathfinding in autonomous driving?





Wu said no. A centralized computing unit in an autonomous vehicle today would be “a lot more complicated than anybody imagines,” he explained. In reality, he expects a multiple number of AI chips to pre-process the data and feed it to a central unit that supposedly makes smart decisions. NovuMind’s AI chip will be one of the many AI chips inside an autonomous car, he explained.
Thus far, Wu said he knows the company’s AI chip can run an application such as “city/nation scale, multi-string, multi-target face recognition.” With its ability to bring in and handle 128 HD video streams, the system powered with this chip can recognize millions of targeted people out of the 100k connected cameras, for example. More important, “We can do it on the edge, with no substantial bandwidth, storage space and set up required for connected cameras,” he explained.
Adding intuitions to sensors
Asked about the future of deep learning, Wu said, “Armed with big data and massive computational power in our hands, we have been able to train neural networks to do many sophisticated things.” That's where the AI community is at today.
ADVERTISING
But what NovuMind hopes to enable, explained Wu, is to add “intuition” to sensors. Just like humans and animals are equipped with five senses, machines should be able to have certain “instincts” that help them react.
When it comes to general intelligence, reasoning and long-term memory for machines, though, Wu said, “We still have a long way to go.”
— Junko Yoshida, Chief International Correspondent, EE Times




Artificial Intelligence Analyzes Gravitational Lenses 10 Million Times Faster

https://www6.slac.stanford.edu/news/2017-08-30-artificial-intelligence-analyzes-gravitational-lenses-10-million-times-faster.aspx

SLAC and Stanford researchers demonstrate that brain-mimicking ‘neural networks’ can revolutionize the way astrophysicists analyze their most complex data, including extreme distortions in spacetime that are crucial for our understanding of the universe.
August 30, 2017
Menlo Park, Calif. — Researchers from the Department of Energy’s SLAC National Accelerator Laboratory and Stanford University have for the first time shown that neural networks – a form of artificial intelligence – can accurately analyze the complex distortions in spacetime known as gravitational lenses 10 million times faster than traditional methods.
“Analyses that typically take weeks to months to complete, that require the input of experts and that are computationally demanding, can be done by neural nets within a fraction of a second, in a fully automated way and, in principle, on a cell phone’s computer chip,” said postdoctoral fellow Laurence Perreault Levasseur, a co-author of a study published today in Nature.
KIPAC scientists have for the first time used artificial neural networks to analyze complex distortions in spacetime, called gravitational lenses, demonstrating that the method is 10 million times faster than traditional analyses. (Greg Stewart/SLAC National Accelerator Laboratory)
Lightning Fast Complex Analysis
The team at the Kavli Institute for Particle Astrophysics and Cosmology (KIPAC), a joint institute of SLAC and Stanford, used neural networks to analyze images of strong gravitational lensing, where the image of a faraway galaxy is multiplied and distorted into rings and arcs by the gravity of a massive object, such as a galaxy cluster, that’s closer to us. The distortions provide important clues about how mass is distributed in space and how that distribution changes over time – properties linked to invisible dark matter that makes up 85 percent of all matter in the universe and to dark energy that’s accelerating the expansion of the universe.
Until now this type of analysis has been a tedious process that involves comparing actual images of lenses with a large number of computer simulations of mathematical lensing models. This can take weeks to months for a single lens.
But with the neural networks, the researchers were able to do the same analysis in a few seconds, which they demonstrated using real images from NASA’s Hubble Space Telescope and simulated ones.
To train the neural networks in what to look for, the researchers showed them about half a million simulated images of gravitational lenses for about a day. Once trained, the networks were able to analyze new lenses almost instantaneously with a precision that was comparable to traditional analysis methods. In a separate paper, submitted to The Astrophysical Journal Letters, the team reports how these networks can also determine the uncertainties of their analyses.
KIPAC researcher Phil Marshall explains the optical principles of gravitational lensing using a wineglass. (Brad Plummer/SLAC National Accelerator Laboratory)
Prepared for Data Floods of the Future
“The neural networks we tested – three publicly available neural nets and one that we developed ourselves – were able to determine the properties of each lens, including how its mass was distributed and how much it magnified the image of the background galaxy,” said the study’s lead author Yashar Hezaveh, a NASA Hubble postdoctoral fellow at KIPAC.
This goes far beyond recent applications of neural networks in astrophysics, which were limited to solving classification problems, such as determining whether an image shows a gravitational lens or not.
The ability to sift through large amounts of data and perform complex analyses very quickly and in a fully automated fashion could transform astrophysics in a way that is much needed for future sky surveys that will look deeper into the universe – and produce more data – than ever before.
The Large Synoptic Survey Telescope (LSST), for example, whose 3.2-gigapixel camera is currently under construction at SLAC, will provide unparalleled views of the universe and is expected to increase the number of known strong gravitational lenses from a few hundred today to tens of thousands.
“We won’t have enough people to analyze all these data in a timely manner with the traditional methods,” Perreault Levasseur said. “Neural networks will help us identify interesting objects and analyze them quickly. This will give us more time to ask the right questions about the universe.”
KIPAC researchers used images of strongly lensed galaxies taken with the Hubble Space Telescope to test the performance of neural networks, which promise to speed up complex astrophysical analyses tremendously. (Yashar Hezaveh/Laurence Perreault Levasseur/Phil Marshall/Stanford/SLAC National Accelerator Laboratory; NASA/ESA)
A Revolutionary Approach
Neural networks are inspired by the architecture of the human brain, in which a dense network of neurons quickly processes and analyzes information.
In the artificial version, the “neurons” are single computational units that are associated with the pixels of the image being analyzed. The neurons are organized into layers, up to hundreds of layers deep. Each layer searches for features in the image. Once the first layer has found a certain feature, it transmits the information to the next layer, which then searches for another feature within that feature, and so on.
“The amazing thing is that neural networks learn by themselves what features to look for,” said KIPAC staff scientist Phil Marshall, a co-author of the paper. “This is comparable to the way small children learn to recognize objects. You don’t tell them exactly what a dog is; you just show them pictures of dogs.”
But in this case, Hezaveh said, “It’s as if they not only picked photos of dogs from a pile of photos, but also returned information about the dogs’ weight, height and age.”
Scheme of an artificial neural network, with individual computational units organized into hundreds of layers. Each layer searches for certain features in the input image (at left). The last layer provides the result of the analysis. The researchers used particular kinds of neural networks, called convolutional neural networks, in which individual computational units (neurons, gray spheres) of each layer are also organized into 2-D slabs that bundle information about the original image into larger computational units. (Greg Stewart/SLAC National Accelerator Laboratory)
Although the KIPAC scientists ran their tests on the Sherlock high-performance computing cluster at the Stanford Research Computing Center, they could have done their computations on a laptop or even on a cell phone, they said. In fact, one of the neural networks they tested was designed to work on iPhones.
“Neural nets have been applied to astrophysical problems in the past with mixed outcomes,” said KIPAC faculty member Roger Blandford, who was not a co-author on the paper. “But new algorithms combined with modern graphics processing units, or GPUs, can produce extremely fast and reliable results, as the gravitational lens problem tackled in this paper dramatically demonstrates. There is considerable optimism that this will become the approach of choice for many more data processing and analysis problems in astrophysics and other fields.”    
Part of this work was funded by the DOE Office of Science.
-Written by Manuel Gnida

Citation: Y.D. Hezaveh, L. Perreault Levasseur, P.J. Marshall, Nature, 30 August 2017 (10.1038/nature23463)
Press Office Contact: Andrew Gordon, agordon@slac.stanford.edu, (650) 926-2282

SLAC is a multi-program laboratory exploring frontier questions in photon science, astrophysics, particle physics and accelerator research. Located in Menlo Park, California, SLAC is operated by Stanford University for the U.S. Department of Energy Office of Science. To learn more, please visit www.slac.stanford.edu.
SLAC National Accelerator Laboratory is supported by the Office of Science of the U.S. Department of Energy. The Office of Science is the single largest supporter of basic research in the physical sciences in the United States, and is working to address some of the most pressing challenges of our time. For more information, please visit science.energy.gov.