LONDON – Following the launch of its AI inference chip last year, Habana Labs (Tel-Aviv, Israel) has unveiled an AI training chip built on the same architecture that can outpace the incumbent technology by a substantial margin, and features on-chip RoCE (remote direct memory access over Converged Ethernet) communications for scalability.

While the company’s inference chip, Goya, set records for ResNet-50 inference back in September 2018, the new training chip, Gaudi, offers similar high performance. Gaudi can process 1650 images per second at a batch size of 64 when training a ResNet-50 network, which Habana claims is a new world record for this benchmark. This throughput is delivered at 140W power consumption, also a substantial advantage versus competing solutions, according to the company.

Impressive, but is Habana’s architecture designed specifically to beat the ResNet-50 benchmark, or will it offer similar throughput advantages for other types of neural networks?

“There is nothing in the architecture that limits it to be a ResNet-50 machine, not at all,” said Eitan Medina, Habana Labs’ Chief Business Officer. “A company like Facebook wouldn’t have spent the time to integrate Goya [into its Glow machine learning compiler] if it was just a ResNet-50 machine… Our customers are implementing Goya in anything from vision processing to recommendation systems – many, many types of applications.”

Core architecture
The new Gaudi training chip joins the Goya inference chip in the Habana portfolio. Similar to Goya, Gaudi also has eight VLIW SIMD (very long instruction word, single instruction multiple data) vector processor cores, which it calls tensor processor cores (TPC), that are specially designed for AI workloads. The differences include the types of data supported – while both are mixed precision, Goya focuses on integer multipliers, while Gaudi has higher emphasis on more precise data formats such as BF16 and FP32.

The high-level architecture of Habana Labs’ Gaudi processor. (Source: Habana Labs)

“The TPC core itself is also different. By now this is a second generation of the TPC core, a VLIW machine designed from scratch with our instruction set,” said Medina. “The training chip also has a different type of memory. With Goya, we have DDR4 memory, with Gaudi, we have four HBM2 (high bandwidth memory, second generation) memories. So, it’s a different balance of throughput and on-chip memory compared to Goya.”

Gaudi comes on an OCP (Open Computer Project) accelerator model-compatible mezzanine accelerator card with 32GB of HBM2 memory (HL-205), or in an eight-card supercomputer box for datacentres (HLS-1). While the exact amount of on-chip memory wasn’t disclosed – other than describing it as substantial – Medina said, “The training solution has an incredible amount of throughput to the HBM2s, so we are not that sensitive to on-chip memory size, and we designed the specialised memory controller to deliver 1 TB/s of throughput, that’s very high throughput for any scale of processor.”

 

Habana Labs HLS-1 system combines eight Gaudi accelerator cards. (Source: Habana Labs)

Habana’s software stack, SynapseAI, directly interfaces with deep learning frameworks such as TensorFlow, PyTorch, and Caffe2. There is also a complete programming toolchain for the TPC.  

Connectivity and Scalability

Aside from raw performance, another important aspect for AI training processors is scalability. AI accelerators are used in their multiples in large training farms, with many devices collaborating on training the same neural network. Nvidia purchased networking IC vendor Mellanox earlier this year, along with its RoCE technology, in part to address the communications bottleneck inherent to distributed computing systems used for deep learning.

“Our position is that [Nvidia CEO Jensen Huang] is absolutely right, RoCE is a perfect solution for scaling AI,” said Medina.

Gaudi integrates on-chip RoCE, with 10 ports of 100 Gigabit Ethernet directly on the processor silicon, a feature Habana claims is unique for AI acceleration processors, since competing solutions usually need additional chips for connectivity. In a system like HLS-1, some of these 10 on-chip ports can be allocated to non-blocking all-to-all connectivity between individual Gaudi processors.

“By providing this level of integration of RoCE, we really unleash the customer’s ability to design systems with unlimited scale, from small systems to very large systems…really giving them the headroom when they invest in AI training acceleration,” continued Medina.

To illustrate the flexibility of Gaudi system design, Medina described how the ports may also be configured as 20 50GB ports for connecting 16 Gaudis with no extra components, or use off-the-shelf Ethernet switches to create hierarchical structures used for massive data parallelism. Ethernet switches may also be used to connect 64 Gaudis in one networking hop, suitable for the emerging technique of model parallel training, which requires huge bandwidth.

Block diagram of Habana Labs’ HLS-1 system. (Source: Habana Labs)

“Maybe the most important thing is giving [customers] a standards-based approach to scaling their AI, [so they] avoid the lock-in that comes when they buy into an architecture that has a proprietary interface,” said Medina. “Most of the solutions out there today from both market leaders as well as startups are promoting proprietary system interfaces. We have the philosophy that if someone comes along that’s better than us, you should be able to take us out and swap in the other competitor processing solution. And we can’t control all your infrastructure.”

The Competition

Habana Labs has raised $120 million so far, with its last round led by Intel Capital. More than 140 employees and contractors are based in Israel, Poland, California, and China.

It’s clear that Habana is going after Nvidia, and fast. Habana’s presentation heavily featured direct comparisons between Gaudi and Nvidia’s market leading V100 GPU, and Habana’s HLS-1 and Nvidia’s DGX-1.

While Nvidia is the market leader today, many of the big data center companies are also developing their own AI accelerators. How will Habana hope to compete with customers’ homegrown devices?

“The answer is simple – we have to be better,” said Medina. “Economies of scale – with us being able to sell to multiple customers – are strategically on our side. To justify investment of hundreds of millions of dollars, just for the amount [of chips] you consume internally, you have to be a gigantic company with very deep pockets, and also you need a strategic reason why you believe it’s important to have that control. You could ask, why is there an x86 business? For the same reason, if you perform, they will grant you the business, because at the end of the day they need to compete with their peers in their industry, and they cannot afford not to use the best technology.”

Gaudi devices will start sampling to select customers in the second half of this year.