Cerebras will describe at Hot Chips the world's largest chip, a wafer-scale device aimed at unseating Nvidia's dominance in training neural networks.
SAN JOSE, Calif. – Startup Cerebras will describe at Hot Chips the world’s largest semiconductor device, a 16nm wafer-sized processor array that aims to unseat Nvidia’s GPUs dominance in training neural networks. The whopping 46,225mm2 die consumes 15kW, packs 400,000 cores, and is running in a handful of systems with at least one unnamed customer.
Also, at this week’s event Huawei, Intel and startup Habana will detail their chips for training neural networks. They all aim to attack Nvidia which last year sold about $3 billion in GPUs for the performance-hungry application.
Intel’s 1.1 GHz Spring Crest aims to stand out from the pack by ganging its 64 28G serdes into 16 112Gbit/second lanes linking up to 1,024 chips. The proprietary interconnect is a direct, protocol-less link that does not need to pass through external HBM2 memory, enabling a relatively fast way to spread large neural networks across multiple processors and chassis.
By putting all its cores, memories and interconnects on one wafer, the Cerebras approach will be even faster and fit in one box.
The startup has raised more than $200 million from veteran investors to be the first to commercialize wafer-scale integration, pioneering new techniques in packaging and wafer handling. It’s betting the AI training market will expand from seven hyperscale data centers to hundreds of companies in everything from pharma to fintech who want to keep their data sets to themselves.
How it works
The Cerebras device packs 84 tiles in a 7×12 array. Each includes about 4,800 cores geared for AI’s sparse linear algebra with 48 KBytes SRAM each, their sole memory source.
The single-level hierarchy speeds processing, enabled by the training app’s little need for memory sharing across cores. The total 18-Gbytes SRAM on the chip is huge compared to a single Nvidia GPU, but small compared to systems Cerebras aims to compete with.
The company will not comment on the frequency of the device which is likely low to help manage its power and thermal demands. The startup’s veteran engineers have “done 2-3 GHz chips before but that’s not the goal here–the returns to cranking the clock are less than adding cores,” said Andrew Feldman, chief executive and a founder of Cerebras.
Feldman wouldn’t comment on the cost, design or roadmap for the rack system Cerebras plans to sell. But he said the box will deliver the performance of a farm of a thousand Nvidia GPUs that can take months to assemble while requiring just 2-3% of its space and power.
The company aims to describe the system, its performance and benchmarks at the Supercomputer show in November. Attendees there will appreciate its historic significance given the last similar effort was on a 3.5-inch wafer by Trinity, the 1980’s supercomputer startup of Gene Amdahl.
The Cerebras compiler will ingest a TensorFlow or Pytorch model, convert it to machine language and use microcode libraries to map neural network layers to regions of the giant chip. It does that in part by programming instructions on the cores and configuring the mesh network that links the tiles.
“We will keep the whole network on the chip. Everyone else nibbles at the network and spends time going back and forth” over slower external interconnects often through memory, he said.
Nearly two-thirds of the 174 engineers at Cerebras are software developers in a sign of the complexity of the AI and compiler code. They face “a boatload of Q&A” before the first commercial systems are commissioned, Feldman said.
Facing Nvidia, Intel, Huawei and other startups
“If they can get this wafer to work, it will be groundbreaking,” said Karl Freund, an analyst for AI and high-end systems at Moor Insights & Strategy. “The problems they are solving are hard, but not moonshots, so I assume they will get this done sometime in the next year,” he added.
Cerebras faces Nvidia’s estimated 90+% monopoly in the AI accelerator market. It’s 16nm products will arrive about the same time Nvidia starts shipping its 7nm Ampere GPU.
In addition, Intel will describe its 28-core Spring Crest and startup Habana will present an eight-core training processor at Hot Chips. In addition, Huawei will describe its training chip and startup Graphcore has amassed $300 million in financing and support from Dell for its 1,200-core chip.
“People are trying all kinds of things–how big the cores are, how much memory and bandwidth they have, and how they are connected. It remains to be seen what the right combination is,” said Linley Gwennap of the Linley Group, noting few are quoting benchmarks at this stage. (Training numbers for Spring Crest and Habana are expected on MLPerf before the end of October.)
AI software holds many potholes such as how many of TensorFlow’s operations does a chip support, and can it perform well across the wide range of neural network types, Gwennap added.
Pioneering wafer-scale integration
For its part, Cerebras plowed through challenges in yields, power and thermals to deliver a wafer-scale device. It is applying for about 30 patents and has about half a dozen issued so far.
For example, the typical 300mm wafer from TSMC may contain “a modest hundred number of flaws,” said Feldman. Cerebras gave its Swarm interconnect redundant links to route around defective tiles and allocated “a little over 1% [of the tiles] as spares.”
Of the more than 100 wafers it has produced to date, all are running at acceptable levels. To power and cool them, Cerebras designed its own board and cooling plate delivering power and water vertically to each tile. The rack includes a closed-loop system to air-cool the water.
It also worked with partners to design a machine for handling and aligning the wafer. “We have fluidics, materials scientists, and manufacturing engineers in the company,” Feldman said.
The startup worked with TSMC to invent a way to place its interconnects in the scribe lines between the tiles, an area usually reserved as a keep out-zone between die.
A whole new way to build a computer
The startup’s plan to unveil its system at the Supercomputer event suggests it sees a market for wafer-scale devices far beyond the seven hyperscale data centers.
As for AI training, “initially, we thought there would be 200 customers in the world, but we’ve revised that estimate to 1,000,” said Feldman. “Everywhere we go, we find companies with large data sets they don’t want to keep in Google Cloud where a single training run might cost $150,000,” he added.
Car makers, drug companies, oil and gas explorers and financial companies will handle their own training. “Hyperscalers are an important segment but they are nowhere near even half the market,” he said.
Fred Weber, an investor in Cerebras and a former engineering manager behind AMD’s Opteron CPUs, sees even broader potential for wafer-scale integration (WSI). He envisions its use for traditional high-performance computing jobs such as signal processing, weather forecasting, simulation/emulation and even network switching.
“There are interesting virtuous cycles in tech every so often like Moore’s law where you can shrink silicon and someone will pay you for it–every generation was hard, but you knew it was worth it,” Weber said.
“Wafer-scale integration may be similar. Its problems are hard but not impossible. And now with training, there’s a business reason to do it,” he said, adding that WSI “has been an area of huge interest to me for some time because of my background at Kendall Square Research working on massively parallel computers.”
That said, “AI training is not a niche app. Were at the very beginning of what AI can do because it’s a general platform. I’m very bullish AI is a compute paradigm not an application,” Weber said.
In that regard “Cerebras is the most interesting of the many startups I’m involved with because it’s both a heck of an AI machine and a whole new way to build a computer,” he said.