Sunday, September 3, 2017

Ex-Baidu Scientist Blazes AI Shortcut

http://www.eetimes.com/document.asp?doc_id=1332226


Native support for 3D tensor operation
8/31/2017 05:31 PM EDT
3 comments
NO RATINGS
MADISON, Wis. — Ren Wu, formerly a distinguished scientist at Baidu, has pulled a new AI chip company out of his sleeve, called NovuMind, based in Santa Clara, Calif.
In an exclusive interview with EE Times, Wu discussed the startup’s developments and what he hopes to accomplish.
Established two years ago, with 50 people, including 35 engineers working in the U.S. and 15 in Beijing, NovuMind is testing what Wu describes as a minimalist approach to deep learning.
Rather than designing general-purpose deep-learning chips like those based on Nvidia GPUs or Cadence DSPs, NovuMind has focused exclusively on developing a deep learning accelerator chip that “will do inference very efficiently,” Wu told us.
NovuMind has designed an AI chip that uses only very small (3x3) convolution filters.
This approach might seem counterintuitive at a time when the pace of artificial intelligence has accelerated almost dizzyingly. Indeed, many competitors concerned with yet-to-emerge AI algorithms have set their sights on chips that are as programmable and powerful as possible.
In contrast, NovuMind is concentrating on “only the core of the neural network that is not likely to change,” said Wu. He explained that 5x5 convolution can be done by stacking two 3x3 filters with less computation, and 7x7 is possible by stacking three. “So, why bother with those other filters?”
The biggest problem with architectures like DSP and GPU in deep-learning accelerators on edge devices is “the very low utilization” of their processors, Wu said. NovuMind solves “this efficiency issue by using unique tensor processing architecture.”
Wu calls NovuMind’s idea — focused on the minimum set of convolutions in a neural network — “aggressive thinking.”  He said the mission of his new chip is to embed power-efficient AI everywhere.
The company’s first AI chip — designed for prototyping — is expected to be taped out before Christmas. Wu said by February next year he expects applications to be up and running on a 15 teraflops of performance (ToP) chip under 5 watts.
A second chip, designed to run under a watt, is due in mid-2018, he added.
NovuMind's new chip will support Tensorflow, caffe and torch models natively.
The endgame of Wu’s AI chip is to enable a tiny Internet-connected “edge” device to not only “see” but “think” (and recognize what it sees), without hogging the bandwidth going back to the data center. Wu calls it the Intelligent Internet of Things (I²oT).
Ren Wu
Ren Wu
For Wu, who hasn’t sought much publicity in the last few years, NovuMind presents, in a way, an opportunity for redemption.
Two years ago, Wu was let go by Baidu, after the Chinese search giant was disqualified from the ImageNet Large Scale Visual Recognition Challenge (ILSVRC) in 2015. Wu subsequently denied wrongdoing in what was then labeled as “Machine learning’s first cheating scandal.”
Speaking with EE Times, he declined to discuss that event, other than noting, “I think I was set up.”
In today’s hotly pursued market of deep-learning accelerators for edge devices, NovuMind is forging ahead. After raising $15.2 million in series A funding in December 2016, NovuMind is about to begin a second round of fundraising, said Wu. “That’s why I am in Beijing now,” he told me during a phone interview.



3D tensor operation
As Wu tells us, the key to deep-learning acceleration, especially on edge devices, is to maximize efficiency while minimizing latency. Naturally, many edge devices are constrained by cost and battery life. Latency has no place in drones and autonomous vehicles, since they must be able to recognize eminent danger without delays. 
Against that backdrop, Wu noted two existing solutions currently available for deep- learning acceleration in edge devices: DSP (such as CEVA and Tensillica) and GPU (such as Nvidia’s TX series).
As he explained, DSP was designed for digital filtering, using 1D multiplication-and-accumulation (MAC) to finish the task. The essence of GPU (and Tensor Processing Unit) operation is 2D general matrix multiplication (GEMM).
(Source: NovuMind)
Click here for larger image
(Source: NovuMind)
Click here for larger image
(Source: NovuMind)
Click here for larger image
(Source: NovuMind)
Click here for larger image
(Source: NovuMind)
Click here for larger image
(Source: NovuMind)
Click here for larger image
In Wu’s opinion, neither DSP nor GPU is efficient enough for deep-learning acceleration tasks. He explained that the state-of-the-art in deep-learning network model computation is 3D tensor operation. “Naturally, when you convert 3D tensor operation into 1D MAC operation (for DSP case) or 2D GEMM operation (for GPU case), you lose a lot of efficiency.”
Wu explained, “That’s why even though GPU and DSP claim high peak performance (~1-2 ToPS), its average performance when running a real deep learning network inference is only 20-30 percent of its peak performance in real time applications.”
He said much processing energy is wasted on memory access. On average, 70-80 performance of computation resources lie idle waiting for data from memory.
NovuMind uses what Wu described as “unique tensor processing architecture.”
In NovuMind’s chip architecture, 3D tensor operation is natively supported, he noted. This helps “greatly enhance efficiency in terms of both energy and silicon area.”
According to Wu, NovuMind’s architecture can achieve 75 to 90 percent of its peak performance in real applications.
Memory hierarchy
Wu claimed that NovuMind’s design “based on 3D tensor operation” has given its AI chip “a tremendous advantage.” He noted, “Because we work directly on 3D tensor and we don’t need to do the intermediate step to expand convolution into 2D matrix, we are able to save a lot of memory bandwidth, memory access energy.”



Trade-offs
Engineering is all about trade-offs. In pursuit of the power efficiency necessary for embedded AI, what did NovuMind have to give up in its AI chips?
Wu said, “We only support a limited set of topology, such as layers defined in VGG, RESNET network, and another small set of other network layers we think are important and relevant.”
He noted, “Our chip will compute these supported network layers very efficiently. It can still do other layers, but it is not as optimal.”
Asked about downside, he described NovuMind’s AI chip as “less general.” If the network contains many unsupported layers, “its performance is no longer competitive,” he said. But Wu is confident. “We believe, with our strong AI team and in-house training capabilities, we have covered all important layers relevant to real-world applications.”
ADVERTISING
We also asked what convinced NuvoMind that 3x3 filters are the way to go. Wu said, “I have to give credit to the original VGG paper and its authors.”
VGG is the Visual Geometry Group, Department of Engineering Science at the Oxford University. VGG researchers authored a paper entitled “Very Deep Convolutional Networks for Large-Scale Image Recognition” in 2015.
The VGG paper convinced Wu to map its architecture into hardware. Wu was surprised to find out how hardware-friendly it was. “This is one of the very rare cases that algorithm designers have come up such an elegant and hardware-friendly design. Just beautiful,” he said. Wu believes that other practical useful network topologies we see today are based on the work done by VGG.
Wu added, “Since 3x3 convolution is such an important building block, our design of course will make sure, do whatever we can, to make it as efficient as possible.”
Latency comparisons
Wu claims NovuMind's architecture also excels in latency compared to DSP and GPU.
He observed, “DSP is designed for stream data processing, and its latency is good.”
On the other hand, he noted, “GPU generally needs batch operation and its latency is poor (50-300 ms with batch size of 8-64),” making it difficult to meet real-time demands.
He explained that NovuMind architecture also uses stream-mode data processing (latency < 3 ms). “We can imagine when an autonomous car drives at 65 mph and needs to break at once, the latency advantage of NovuMind architecture over GPU translates into a range of 4.5-30 feet of distance.” He boasted, “This can make a big difference in an autonomous car.”
(Source: NovuMind)
Click here for larger image
(Source: NovuMind)
Click here for larger image
Roadmap
NovuMind’s first chip will be manufactured by an undisclosed foundry, using a 28nm process technology. The second chip — aimed at mid-2018 for tape-out — will be using a 16nm process technology, according to Wu.
Describing the first chip as produced for prototyping purposes, Wu posed several scenarios for its chip applications. One is a USB stick incorporating the NovuMind chip, thus making connected devices, such as connected cameras, AI-driven systems. Second, with the 15 teraflops of operation, the AI chip can be used in “autonomous cars,” Wu said. The third application, suggested by Wu, is using its AI chip for cloud acceleration.
GPUs used in data centers place limitations on rack space, Wu observed. Higher power dissipation — extra heat — coming from a GPU is its culprit. Although NovuMind’s AI chip is designed for edge devices, when put on a PCI-board inside a server, its tiny package can efficiently run a single application such as speech recognition, which must be processed at the data center.
But really, what sort of AI applications are best for NovuMind’s AI chip? Is NovuMind saying that its AI chip would be ideal for, say, pathfinding in autonomous driving?





Wu said no. A centralized computing unit in an autonomous vehicle today would be “a lot more complicated than anybody imagines,” he explained. In reality, he expects a multiple number of AI chips to pre-process the data and feed it to a central unit that supposedly makes smart decisions. NovuMind’s AI chip will be one of the many AI chips inside an autonomous car, he explained.
Thus far, Wu said he knows the company’s AI chip can run an application such as “city/nation scale, multi-string, multi-target face recognition.” With its ability to bring in and handle 128 HD video streams, the system powered with this chip can recognize millions of targeted people out of the 100k connected cameras, for example. More important, “We can do it on the edge, with no substantial bandwidth, storage space and set up required for connected cameras,” he explained.
Adding intuitions to sensors
Asked about the future of deep learning, Wu said, “Armed with big data and massive computational power in our hands, we have been able to train neural networks to do many sophisticated things.” That's where the AI community is at today.
ADVERTISING
But what NovuMind hopes to enable, explained Wu, is to add “intuition” to sensors. Just like humans and animals are equipped with five senses, machines should be able to have certain “instincts” that help them react.
When it comes to general intelligence, reasoning and long-term memory for machines, though, Wu said, “We still have a long way to go.”
— Junko Yoshida, Chief International Correspondent, EE Times




No comments:

Post a Comment