MLPerf training benchmark scores reveal there are two contenders for world’s fastest AI supercomputer. Nvidia’s Selene uses 2048 A100 chips, while Google’s TPU v3 supercomputer uses 4096 devices...
The third round of MLPerf training benchmark scores for eight different AI models are out, with rivals Nvidia and Google both staking a claim to the crown.
While both companies claimed victory, the results bear further scrutiny. Scores are based on systems, not individual accelerator chips. While Nvidia swept the board for commercially available systems with its Ampere A100-based supercomputer, Google’s massive TPU v3 system and smaller TPU v4 systems, which it entered under the research category, makes the search giant a strong contender.
Nvida took first place in normalized results for all benchmarks in the commercially available systems category with its A100-based systems. Nvidia accelerators dominated the benchmark result submissions as a whole, with submissions from Nvidia as well as third-party system builders such as Fujitsu, Inspur, Dell, Tencent and Alibaba making up the majority of the list.
Google claimed the number one fastest training time with its TPU v3 based supercomputer for several benchmarks, but its system has twice as many accelerator chips as Nvidia’s large-scale offering. Google also gave us a hint as to what TPU v4 will be capable of, with results 2-3x the performance of TPU v3 scores from the previous round.
RecommendHow to choose the right processor IP for your ML applicationTake advantage of wide-ranging AI opportunities. GigaOm report shows you how to devise, define, and deploy the right AI for job.Benchmark background
Benchmark submissions measure the time to train to the required accuracy for any of eight different models. Selection of the models is designed to reflect customer AI workloads today as far as possible, including image classification, two versions of object detection and two versions of translation (one recurrent, one not).
New to this round is BERT (bidirectional encoder representation from transformers), a natural language processing model which is used extensively as a building block for conversation, translation, search and text generation. Another addition is DLRM (deep learning recommendation model), a model widely used for online shopping websites, social media and search results.
The Mini-Go benchmark from previous rounds has been beefed up; it now uses the full-size 19×19 Go board. This benchmark is the most difficult because it relies on reinforcement learning, that is, it isn’t fed a training dataset – the system learns by doing inference (paying the game of Go against itself), thereby creating its own training data as it goes.
Submissions are per system (not per chip) and vary in scale from 4x CPUs without any accelerators to Google’s 4096-accelerator supercomputer. There are three categories: commercially available, preview, and R&D. Systems classed as commercially available have to prove all their hardware and software is on the market and in use by third-parties. Preview systems are not on the market, but will be by in six months (or by the next round of benchmarks) – submitters have to commit to submitting the same or improved results for the same systems in the next round in the commercially available category, or face disqualification. R&D submissions don’t have to meet any availability criteria.
Nvidia claimed first place across all the benchmarks for commercially available systems, both in terms of best system performance, and best performance normalized by number of accelerator chips. Since Nvidia’s flagship Ampere architecture was announced a couple of months ago, all the company’s benchmark submissions using the company’s state-of-the-art Ampere A100 chips fall under the commercially available category. The company submitted scores across all of the benchmarks, many more than any challenger.
Nvidia submitted scores from its biggest system, the newly minted DGX-SuperPod supercomputer, AKA ‘Selene’. This is the fastest commercial system for AI in the United States, according to Nvidia, comprising 2048 Ampere A100 chips and offering more than 1 Exaflops of AI compute.
“There are two ways to look at the performance records. First is the absolute fastest performance across any scale. Our DGX-SuperPod was able to train every model in just under 18 minutes,” said Paresh Kharya, senior director of product management for data center computing at Nvidia. “The second way [to look at it] is not every customer will run all applications at massive scale. So we looked at the normalized, per chip performance and here again, Nvidia A100 broke all the performance records in the commercially available systems category.”
How does Nvidia get such good results?
“The answer is really the relentless focus that we have on the full stack innovation, starting with software,” said Kharya. “We’ve been investing billions of dollars in our architecture, as well as in our software with our ecosystem. This performance is basically a result of all of those efforts coming to fruition.”
Meanwhile, Google is also claiming it has the ‘world’s fastest training supercomputer’, based on the same set of results.
In a company blogpost, Naveen Kumar, senior staff engineer at Google AI, writes that: “Four of the eight models were trained from scratch in under 30 seconds. To put that in perspective, consider that in 2015, it took more than three weeks to train one of these models on the most advanced hardware accelerator available. Google’s latest TPU supercomputer can train the same model almost five orders of magnitude faster just five years later.”
This supercomputer is four times the size of the system Google used in the last round of benchmarks, boasting 4096 TPU v3 chips. However, Google have only submitted results into the R&D category for this system, meaning it is not commercially available yet.
Google’s TPU v3 supercomputer beat Nvidia’s pride and joy, Selene, on four of the benchmarks, enabling it to claim that its system is the fastest in the world. However, it’s twice the size of Selene, and when the results are normalized to per-chip, the TPU v3 results don’t stack up (see under ‘Who wins?’ below). All this is really telling us is which system is bigger; not which is winning on compute efficiency.
Google also previewed some results from the next generation of TPU, v4, and gave us a few details.
“Google’s fourth-generation TPU ASIC offers more than double the matrix multiplication TFLOPs of TPU v3, a significant boost in memory bandwidth, and advances in interconnect technology,” wrote Kumar in his blog. “Google’s TPU v4 MLPerf submissions take advantage of these new hardware features with complementary compiler and modeling advances. The results demonstrate an average improvement of 2.7 times over TPU v3 performance at a similar scale in the last MLPerf Training competition.”
Nvidia showed a normalized set of results in the chart below, which is normalized by system size (number of accelerator chips), and with Nvidia V100 (previous gen product) performance set to 1. This chart shows devices from all the categories (commercially available, preview and R&D). As well as Nvidia and Google, it includes one result for the Huawei Ascend 910 and three from Intel’s upcoming Cooper Lake CPU. Cooper Lake did best on the Minigo reinforcement learning, relatively speaking, at about half the performance of the V100 – though EE Times suspects this is because the others just found Minigo much harder to accelerate than the other models.
Who you consider as the eventual winner depends on whether you include commercially available systems only – in which case Nvidia’s A100 essentially destroyed its few challengers. If you consider preview and R&D systems too, on per-chip results Google TPU v4 beats Nvidia on three out of the eight models. In EE Times’ book, this is still not enough to claim an overall win.
Overall, there is still a frustrating lack of submissions from companies other than Google, Nvidia, Intel and Huawei. There were still no results submitted from vocal challengers such as Graphcore and Cerebras. Submitting companies are also extremely reticent about which results they submit, with very few figures from similar scale systems ending up in the same column as each other (hence all the normalization that goes on in analysis), due to the fear of coming in in second place. The scores, which are intended to provide ‘apples to apples’ comparisons, are therefore not as useful as they could be for comparing different accelerator architectures.
“The reason for [the low number of submissions] is while there are a lot of companies that are working on creating custom silicon as well as touting their performance, AI is really hard,” said Nvidia’s Paresh Kharya. “Delivering exceptional performance on AI is really hard, and it requires a lot more than custom silicon. It takes a lot of software as well as a broad ecosystem that works together to innovate and create a full stack that makes AI performance possible. At Nvidia, we’ve invested billions of dollars, and we’ve been working on this problem for almost a decade.”
On the other hand, the scores do provide an overview of how far the AI accelerator industry has come in just a few short years. Compared to the previous round six months ago, the fastest results for the five unchanged benchmarks improved by an average of 2.7x. And it’s certainly true that both leaders in this market are innovating at pace. Google’s TPU v4 was on average 2.7x faster than TPU v3 was in the last round, and while Nvidia’s V100 had increased incrementally in previous rounds, the Ampere scores are between 2x and 4x faster.
The full table of MLPerf results scores can be viewed here.