AI chip startups are hot on the heels of GPU leader Nvidia. At the same time, there is also significant competition in data center inference...
New computing models such as machine learning and quantum are becoming more important for delivering cloud services. The most immediate computing change has been the rapid adoption of ML/AI for consumer and business applications. This new model requires the processing vast amounts of data to developing usable information, and eventually building knowledge models. These models are rapidly growing in complexity – doubling every 3.5 months. At the same time, performance requirements for quick response are increasing.
AI hardware in the data center can be broken into two distinct processing phases: training and inference. Each has distinctly different characteristics in terms of compute and responsiveness. The training aspect requires the handling of extraordinarily large data sets to create processing models. This can take hours, days, or even weeks. Inference, on the other hand, is the use of those trained models to process individual inputs that are very time sensitive; the result may be required within a few milliseconds.
Because it requires large computational modeling, training is often performed in 32-bit floating point precision, but sometimes is handled in lower precision such as 16-bit floating point or the alternative Bfloat16 if the same level of accuracy can be maintained. Inference, on the other hand, is often performed using integer math for speed and lower power, and does not require the extensive dynamic range of floating point. Therefore, accelerators are often specialized for one task or the other, even though there are cases where a chip can perform both training and inference it is typically optimized for one or the other.
The more computationally intensive training side of AI is presently dominated by Nvidia GPUs. Until mid-May, when Nvidia released its Ampere A100 powerhouse, its Tesla V100 GPU. Companies positioning themselves to challenge Nvidia include Cerebras, Graphcore and Intel/Habana Labs.
The more time-sensitive and less computationally intensive inference side is more contentious. In fact, some hyperscale cloud providers are building their own inference solutions. The classic example is the Google Tensor Processing Unit initially designed for low power and responsive inference. Once again Nvidia has a significant offering here in the form of the Tesla T4, but there is also significant competition in data center inference. Currently, the vast majority of inference processing is still performed by CPUs (mostly Intel Xeons).
To help solve the benchmarking issues, a group called MLPerf was formed to develop up-to-date benchmarks. The group includes a mix of both academia and industry players from startups and established companies. The industry is aligning behind this benchmark, but the training and inference benchmarks are still a work in progress (presently at revision 0.6). MLPerf will require more industry input and will evolve over time, but some early results are available.
The training results are dominated by Nvidia and Google, but more results are expected later this year. One limitation is that these results do not factor in power or cost, which makes the performance calculations more complex.
Data Center Training
Nvidia’s Tesla V100 is a massive chip that represents the peak of GPU acceleration for AI training. In addition to the individual chips, Nvidia has the NVLink interface that allows multiple GPUs to be networked together to form a larger virtual GPU.
Given Nvidia’s early lead in training as well as the extremely mature software stack, some startups have been reluctant to take on the company directly. But there are others have been much more willing to go head-to-head with the GPU leader.
Among them is Graphcore. Its intelligence processing Unit (IPU) has shown significant power and performance benchmark numbers. The IPU has a massive on-chip memory to avoid copying data on and off the chip to local DRAM. But the IPU is heavily reliant on the Polar compiler to schedule resources for maximum performance. This is an ongoing challenge for many AI chip vendors that have offloaded the control complexity to software in order to make the hardware simpler and more power efficient. While its relatively straightforward to tune the compilers for well-known benchmarks, the real challenge is to work with each customer to optimize the solution for their specific data set and workload.
Elsewhere, Intel has been acquiring AI chip startups over the past few years, including the recently completed purchase of another data center training and inference vendor. Intel’s first deal for Nervana was slow to get off the ground with a new product. Meanwhile, Israeli startup Habana Labs made significant inroads with hyperscale customers like Facebook. Intel eventually acquired Habana Labs to help get to market quicker. After the Habana acquisition, Intel discontinued development of Nervana chips. While the initial Habana Goya chip introduced in 2018 chip focused on inference, the company has recently started shipping the Gaudi training chip to customers.
Like Nervana, the Gaudi training chip has support for the BFLOAT16 data format. It is also available in the Open Compute Project (OCP) accelerator module physical format. Gaudi’s chip-to-chip interconnect uses standard RDMA RoCE over PCIe 4.0, while the Nervana chips uses a proprietary interconnect.
The Habana Gaudi processor and HLS-1 system go head-to-head against Nvidia’s V100-based cards and Nvidia’s DGX rack systems. The Habana HLS-1 uses a PCIe switch to connect multiple Gaudi processors in the HLS-1 rack versus the proprietary NVLink bus used by Nvidia. The key to Habana’s performance is the RoCA v2 interface using ten 100G Ethernet links through a non-blocking Ethernet switch.
The Gaudi cards are rated 300W maximum power for a mezzanine card and 200W for a PCIe card. Habana has yet to release MLPerf training benchmarks, but the ResNet-50 comparisons looks very competitive with Nvidia.
Cerebras, another challenger, has built the ultimate training and inference chip – a single wafer-scale processor. The solution is an entire wafer (46,225 mm2) with 1.2 trillion transistors and 18 GB of on-chip memory. The Cerebras wafer-chip system needs 20,000 Watts to power, putting it in a category all its own. Performance numbers are not yet available.
AMD remains a dark horse. While its GPU can be used for training, the company has not optimized the architecture for Tensor processing and its software stack is behind Nvidia’s CUDA. The company’s two recent design wins with the Department of Energy’s Exascale computers should help it build out a more robust software stack, but machine learning seems to be a lower priority for AMD.
While each chip’s performance is important for training, the ability to scale for larger models is also a critical feature. Scaling often requires interconnecting multiple chips through a high-speed link (that is, unless you are Cerebras).
The newest players are Groq and Tenstorrent, both shipping early samples. Their chips are highly configurable and, like Graphcore, highly dependent on the compiler software to deliver performance.
Data Center Inference
The needs of machine learning inference are quite different from training. While batching many jobs is common for training, often inference performance is judged by latency with low batch numbers. It’s critical that each new inquiry be addressed quickly and accurately. Also, inference data is lower resolution and neural nets are shorter. The weights for inference are provided by the model developed in the training phase and optimized for the inference workload. Here, packed 8-bit integer performance is often key for standard workloads. Optimized compute values can be reduced even further and may be 6-, 4-, 2- or even 1-bit precision as long as the accuracy is not reduced..
The leading Nvidia solution is the Tesla T4, available in half height, 70W PCIe cards that can fit into a 2U rack chassis. The T4 is based on the same Turing architecture as the Nvidia V100 but scaled down. The T4 supports a variety of precision levels with different performance levels, including 8.1 TFLOPS using single-precision floating point, 65 TFLOPS using mixed-precision (FP16/FP32), 130 TOPS using 8-bit integer, and 260 TOPS using 4-bit integer. The T4 offers the flexibility to handle a variety of workloads.
The Intel/Habana Goya chip has proven very competitive on performance and power. Goya’s benchmark results were strong in version 0.5 and the chip has higher ResNet-50 results, but also draws over 100W. Goya is also flexible on data formats, and with Intel behind the chip, customers are no longer betting on a small startup as a supplier.
The most recent new entrant in the inference race was the surprise resurgence of Via/Centaur and its x86 server chip with a ML co-processor. The CHA chip showed significant potential in last year’s MLPerf’s open category, but the company is still developing its software stack.
Qualcomm, meanwhile, announced the Cloud AI 100 chip last year with production expected this year. Few details are available, but what is know is the design uses an array of AI processor and memory tiles Qualcomm says are scalable from mobile to data center. The chip will support the latest LPDDR DRAM for low power and is rated for 350 TOPS (8-bit integer values). Qualcomm is targeting automotive, 5G infrastructure, 5G edge and data center inference for the chip. The company is leveraging its extensive experience in low-power inference from its Snapdragon smartphone processors.
FPGAs can also make good inference engines as well as running high performance computing tasks, offering low latency and flexible ML model support. Microsoft has for several years been using Intel’s Altera FPGAs for accelerating text string searches. Intel’s Vision Accelerator Design includes its Arria 10 FPGA PCIe card and software support in the OpenVINO toolkit. Xilinx has a Alveo line of PCIe cards for data centers supported by its Vitus software platform. The Alveo cards are offered with power from 100 to 225W. Both companies are working to make their software and programming tools more approachable.
Following Altera and Xilinx, Achronix also brought its FPGA technology to accelerated data center computing. Its new 7-nm Speedster7t product ships in the VectorPath PCIe accelerator card. It is also available for IP licensing, setting the company apart from its rivals. The company has focused on fast I/O with PCIe gen 5 support and serdes speeds up to 112 Gbps. Machine learning inference performance will exceed 80 TOPS using INT8 data processing. The chip will also support INT16, INT4, FP24, FP16, and BFloat16. Achronix says the Speedster7t delivers up to 86 TOPS INT8 performance and ResNet-50 performance of 8,600 images per second.
Also entering the very-low-power inference market is FlexLogix, with its InferX X1 edge inference coprocessor optimized for INT8 operation. Fast (batch-1) ResNet-50 inference at only 13.5W makes the InferX the likely low-power leader.
Proceed With Caution
Nvidia remains the leading training accelerator but faces increasing competition. The data center inference market becomes more fragmented as more competitors enter the fray. Later this summer, MLPerf should release the next iteration of its benchmarks and more results from the newer chips will become available.
Proceed with caution, though: results can be tuned to perform well on specific benchmarks but deliver disappointing results on a wider variety of workloads.
–Kevin Krewell is principal analyst at Tirias Research