Deep learning has spawned work on a wide variety of novel chips, but the most interesting architectures have yet to be designed, let alone benchmarked.
SAN JOSE, Calif. — The potential for new architectures to accelerate deep learning is enormous. So far, only one novel chip has been fully described and benchmarked — Google’s TPU — but the pipeline is full and a few of the techniques are becoming clear.
The jungle is dense with possibilities. They include analog computing, a variety of emerging memory and packaging types, and a basket of techniques specific to handling neural networks such as pruning and quantization.
“It’s wide open with people working at every level,” said Marian Verhelst, a professor at KU Leuven in Belgium who worked on research chips exploring binary precision formats. Analog computing looks useful, especially for 3- to 8-bit formats, she said.
The processor-in-memory architectures in the works today using analog computing in 40-nm NOR cells are just the start of strange beasts to come. Such designs will need to migrate to exotic phase-change, resistive, or magnetic RAM cells using FinFETS to scale, notes another researcher.
For its part, Nvidia “has several research projects on analog computing for deep learning, but I have yet to find one worthy of a product,” said Bill Dally, chief scientist for the company and a veteran processor researcher. Some of the math neural nets required and the results that they generate — such as activations — are not well-suited to analog implementations, he added.
“All the ideas rejected for CPUs in the past — analog computing, processor-in-memory, and wafer-scale integration — are being explored again,” said David Patterson, a veteran computer researcher now spending some of his time at Google. “I can’t wait to see how well these radical ideas may work.”
“Two or three years ago, every good computer architect said, ‘I can do this 100× faster,’ so we will soon see lots of different solutions that offer a step function improvement, all getting close to the limits of current technology,” said Chris Rowen, a co-founder of MIPS, Tensilica, and now an AI software startup.
Rowen left his post at Cadence to explore the landscape of deep-learning startups before founding BabbleLabs. He still tries to track about 25 of them (see chart below) that target everything from cloud servers to embedded systems.
A view comparing two dozen of today's AI silicon startups. (Source: Chris Rowen, BabbleLabs)
Startups snub new AI benchmark so far
One of the great frustrations of this renaissance of processor design is the waiting.
Baidu and Google launched the MLPerf benchmark last May in part to create a fair way to measure the chips expected from “dozens and dozens” of startups. “It was a little disappointing that none of the startups submitted results for the first iteration,” said Patterson, who works on the project.
“Maybe they are being tactical, but it raises questions whether they are having problems building working chips or the chips don’t perform as well as they hoped or their software stack is too immature to run the benchmarks well,” he said, throwing down the gauntlet.
The first results on the training benchmark using ResNet-50 showed almost 100% performance scaling for Google’s TPUv3 as it went from eight to 256 chips. By contrast, Nvidia’s Volta scaled about 27% as it went from eight to 640 chips, Patterson noted.
The TPU got an edge because it was designed to run as a multiprocessor on its own network. By contrast, Volta rides x86 clusters, he explained.
Patterson remains hopeful that MLPerf will become for AI accelerators the equivalent of Spec for CPUs. Another batch of training results is expected later this year. A version of MLPerf for inference jobs in the data center and at the edge will also debut this year.
One researcher warned that the industry is too focused on peak performance.
“We feel peak performance is not that useful because it doesn’t take into account discrepancies in efficiency,” said Erwei Wang, a Ph.D. candidate at Imperial College who co-authored a recent survey of AI accelerators. “People should publish sustained performance on standard data sets and benchmarks to better compare architectures.”
Landscape is still in the dark, analysts say
Analysts complain that high-profile startups, including Graphcore and Wave Computing, have yet to provide performance data. The exception so far is Habana Labs.
The startup “seems to have something real, detailing performance of 3× to 5× Nvidia GPUs in a white paper … but they are initially focused on inference, not training,” said Linley Gwennap of The Linley Group.
“Precious little is actually shipping” from the startups, said Karl Freund, who follows the area for Moor Insights & Strategy.
Habana is only sampling, Wave claims to have shipped to unnamed customers, and Graphcore says that it will ship chips by April, Freund said. Groq, formed by former Googlers, could come out of the closet at an event in Beijing in April, and others may release products at a San Francisco event in September, he added.
The exceptions are a handful of China startups such as Cambricon and Horizon Robotics that took a lead, getting to market over their U.S. counterparts, focusing on inference jobs.
“There will be a gold rush for inference because there is no 800-pound gorilla to displace, but I’m skeptical anyone will give Nvidia’s GPUs a significant challenge in training because you don’t switch for an advantage in one product cycle; you need a sustainable lead,” Freund said.
“The only company with a real shot in training is Intel with a late-2019 Nervana chip,” he added. “They will wait until they get it right because if you only have a bunch of MACs and reduced precision, Nvidia will kill you. But they need to solve the memory bandwidth and scaling problem.”
Intel has multiple boats in the race. One Intel AI software manager said that he works most closely with the pairing of its Xeon CPU and a new GPU being designed under former Apple and AMD graphics guru Raja Koduri.
Intel baked new features into its latest Xeon, Cascade Lake, to accelerate AI. It is expected to take the edge off of the need for a GPU or accelerator, but it is not expected to compete head to head with them in performance or efficiency.
For its part, Nvidia is packaging its latest 12-nm processors in a variety of workstations, servers, and rack systems. Some say it is far enough ahead in training that it can save a 7-nm chip for a 2020 follow-on.
The big companies are building competing ecosystems based on proprietary interconnects, packaging techniques, programming tools, and other technologies. For example, Nvidia chips ride its NVLink and use its Cuda environment.
Intel has the most extensive set of lock-ins. They include its proprietary processor interconnect, a memory protocol for its Optane DIMMs, a network fabric, and emerging EMIB and Foveros chip packages.
Rivals AMD, Arm, IBM, and Xilinx ganged together around CCIX and GenZ, a cache-coherent interconnect for accelerators and a link for memory, respectively. Recently, Intel countered with a more open processor interconnect for accelerators and memory called CXL, but so far, it lacks the third-party support of CCIX and GenZ.
Data centers try DIY silicon
While startups race to claim sockets in servers, some of their biggest potential customers are rolling their own accelerators.
Google is already using a third generation of the TPU, a version using liquid cooling so that it can run flat out. Baidu announced its first chip last year, Amazon said that it will roll out its first one later this year, Facebook is ramping up a semiconductor team, and Alibaba acquired a company with processor expertise last year.
With the exception of Google, most are pretty tight-lipped on the architecture and performance of their chips. However, Baidu said that its 14-nm Kunlun comes in versions for training and inference jobs. It delivers 260 tera-operations/second (TOPS) while consuming 100 W and packs thousands of cores with an aggregate 512 GB/s of memory bandwidth.
For its part, Amazon said that its Inferentia will provide hundreds of TOPS of inference throughput, and multiple chips can be used together to drive thousands of TOPS. It claims that existing GPU clusters are inefficient on inference tasks, running at an average of 10% to 30% utilization rates.
“Many startups built a business around selling to a few hyper-scalers, and now, this is probably not going to work,” said Zac Smith, chief executive of Packet, a startup that aims to carve out a niche as a second-tier public cloud provider.
The chips designed by cloud giants may never see teardowns, but plenty of embedded blocks have been described in some detail. They show an evolution from modified DSP and GPU blocks to use of multiply-accumulate arrays to dataflow architectures that pass information generated from one level to the next in a neural net, said Mike Demler, a Linley Group analyst.
Like an AI block in the latest Samsung Exynos, many chips also show a move to heavy use of pruning and quantization, running 8- and 16-bit operations to optimize for efficiency and network sparsity. “If you are not using sparsity and compression at this point, you are behind the curve,” said Demler.
Pruning will get increasingly important. Yann LeCun, father of the popular convolutional neural network (CNN), said that neural-net models will only get larger, demanding more performance. However, he noted that they can be radically pruned given that the human brain supposedly uses only 2% of its maximum activations.
In a recent paper for an audience of chip designers, he called for chips that can handle extremely sparse networks. “When most units are off most of the time, it may become advantageous to make our hardware event-driven so that only the units that are activated consume resources,” he wrote.
Recurrent neural nets are most sparse and, thus, can get cut back the most using fine-grained pruning. Fifty-percent to 90% pruning may be optimal for CNNs, but chip designers will face challenges supporting the irregularities and flexibility of fine-grained pruning, said Wang of Imperial College.
Reducing both the number of weights and the precision level helps reduce memory requirements. Intel’s Xeon and many other chips already execute inference jobs using 8-bit integer data, while FPGA and embedded chips are pushing into 4-bit and even binary precision, Wang said.
The goal is to move processing as close to the memory as possible, avoiding off-chip accesses. Ideally, that means computing inside registers or at least in caches, he added.
In his paper, LeCun even imagined programmable registers that teamed memory and processing units.
“To endow deep-learning systems with the ability to reason, they need a short-term memory to be used as an episodic memory … such memory modules will become commonplace and very large, requiring hardware support,” he wrote.
Flexibility needed for life beyond MAC units
If you have to go off-chip, batching many requests into a few larger ones has become a popular technique. Patterson noted a recent Google paper that shines some light on a raging debate over the optimal size of batches.
“If you take care, there’s a region where you get a perfect speedup, then as you increase batch size, you see diminishing returns and then performance plateaus across many models,” Patterson said.
In his paper, LeCun warned that “we will need new hardware architectures that can function efficiently with a batch size of one … implying the end of reliance on matrix products as the lowest-level operator,” sounding a death knell to the multiply-accumulate units at the core of today’s chips.
Given its still-early days in deep learning, the most important guideline is to stay flexible, seeking a balance between programmability and performance.
“Our lesson learned was the neural nets continue to evolve … you can’t make assumptions about the dimensions of a neural net, and you want to be efficient across a wide range of them,” said Vivienne Sze, who worked on the Eyeriss chip.
FPGAs will have a role to play while deep learning evolves, requiring flexible hardware, said Wang. He is bullish on the Versal ACAP from Xilinx as a hybrid of an FPGA with hardened ASIC-like blocks.
Wang will give a glimpse of what the future may hold with a paper that he will present at an FPGA conference in late April. His so-called LUTNet research shows how a lookup table can be tailored to serve as an inference core handling fine-grained pruning without the need for maintaining an index. The result cuts in half the silicon area needed for inference, he claims.
It’s a novel idea at a time when engineers are throwing the kitchen sink at deep learning. For example, Toshiba won praise for a recent ADAS accelerator that packed four Cortex-A53 cores, two Cortex-R4s, four DSPs, and eight specialized accelerator blocks into a 94.5-mm2 chip.
As extreme as the SoC seems, it is just the start of wild things to come.
LeCun called for engineers to imagine “exotic number representation for inference on low-power hardware … Increasingly, input data will come to us in a variety of forms, beyond tensors, such as graphs annotated with tensors and symbols.”
“Generally speaking, the evolution is toward more sophisticated … dynamic network architectures that change with each new input in a data-dependent way, inputs and internal states that are not regular tensors but are graphs whose nodes and edges are annotated with numerical objects, including tensors,” he said, noting that graphs are “likely to violate the assumptions of current deep-learning hardware.”
The fun is just beginning.