The AI community is aiming for higher performance and more power efficient inference engines for deep neural networks.
The current deep learning system leverages advances in large computation power to define network, big data sets for training and access to the large computing system to accomplish its goal.
Unfortunately, the efficient execution of this learning is not so easy on embedded systems (i.e. cars, drones and IoT devices) whose processing power, memory size and bandwidth are usually limited. This problem leaves wide open the possibility for innovation of technologies that can put deep neural network power into end devices.
Asked what’s driving AI to the edge, Marc Duranton, fellow of CEA's Architecture, IC Design and Embedded Software division, during the recent interview with EE Times, has cited three factors—“safety, privacy and economy”—prompting the industry to process data at the end node. Duranton sees a growing demand to “transform data into information as early as possible.”
Think autonomous cars, he said. If the goal is safety, autonomous functions shouldn’t rely on always-on connectivity to the network. When an elderly person falls at home, the incident should be detected and recognised locally. That’s important for privacy reasons, said Duranton. But not transmitting all the images collected from 10 cameras installed at home to trigger an alarm, can also reduce “power, cost and data size,” Duranton added.
Interference engines rush
In many ways, chip vendors are fully cognisant of this increasing demand for better inference engines. Semiconductor suppliers like Movidus (armed with Myriad 2), Mobileye (EyeQ 4 and 5) and Nvidia (Drive PX) are racing to develop ultra-low power, higher performance hardware-accelerators that can execute learning better on embedded systems.
Their SoC work illustrates that inference engines are already becoming “a new target” for many semiconductor companies in the post-mobile era, observed Duranton.
Google’s Tensor Processing Units (TPUs) unveiled earlier this year marked a turning point for an engineering community eager for innovations in machine learning chips.
At the time of the announcement, the search giant described TPUs as offering “an order of magnitude higher performance per watt than commercial FPGAs and GPUs.” Google revealed that the accelerators were used for the AlphaGo system, which beat a human Go champion. However, Google has never discussed the details of TPU architecture, and the company won’t be selling TPUs on the commercial market.
Many SoC designers view that Google’s move made the case that machine learning needs custom architecture. But in their attempt to design a custom machine-learning chip, they wonder what its architecture would look like. More important, they want to know if the world already has a benchmarking tool to gauge deep neural network (DNN) performance on different types of hardware.
CEA has said it’s fully prepared to explore different hardware architectures for inference engines. CEA developed a software framework, called N2D2, enabling designers to explore and generate DNN structures. “We developed this as a tool to select the right hardware target for DNN,” said Duranton. N2D2 will become available as open source in the first quarter of 2017, he promised.
The key to this new tool is that N2D2 doesn’t just compare different hardware on the basis of recognition accuracy. It can compare hardware in terms of “processing time, hardware cost and energy consumption.” This is critical, said Duranton, because different applications for deep learning will likely require different parameters in various hardware implementations.
The N2D2 offers benchmarking on a variety of commercial off-the-shelf hardware–including multi/many-core CPUs, GPUs and FPGA.
As a research organisation, CEA has been studying how best to bring deep neural networks to edge computing. Asked about barriers to DNN on edge computing, Duranton said it’s clear that “floating point” server solutions cannot be applied, because of “power, size and latency constraints.” Other limitations include: “a number of MACs, bandwidth and on-chip memory size,” he added.
Duranton believes that specialised architecture could also use new coding, such as “spike coding.” As CEA researchers have studied the properties of neural networks, they discovered that such properties are inherently tolerant to computing errors. This makes them a good candidate for “approximate computations,” they determined.
If so, we may not necessarily need binary coding. That’s good news because temporal coding—such as spike coding—can produce much more energy efficient results at the edge, Duranton explained.
Spike coding is attractive because a spike-coded–or an event-based–system shows how data in real neural systems is encoded. Further, event-based coding could be compatible with dedicated sensors and pre-processing.
A coding more similar to the one used by the nervous system also facilitates mixed analog and digital implementations, allowing researchers to build a smaller hardware accelerator consuming little energy.
There are other factors that can help accelerate the DNN on edge computing. CEA, for example, is pondering the potential need to tune the neural network architecture itself to edge computing. “People have started talking about the use of ‘SqueezeNet’ instead of AlexNet,” Duranton noted. SqueezeNet can reportedly accomplish AlexNet-level accuracy with 50x fewer parameters, he explained. This sort of simplification is needed in edge computing, in topology and a reduced number of MACs.
The goal, as Duranton sees it, is in the automated transformation of “classical” DNN into “embedded” networks.