The two have pulled away from would-be rivals, but at what cost?
Over forty companies and eight research institutions comprising the nascent artificial intelligence (AI) industry have defined a set of standardized benchmarks called mlperf to enable comparisons of the various chips used to accelerate machine learning (ML) training and inference. In the second slate of training results (V 0.6) released today, both Nvidia and Google have demonstrated their abilities to reduce the compute time needed to train the underlying deep neural networks used in common AI applications from days to hours.
However, the cost of delivering these impressive results remains mind-boggling: note that the Nvidia DGX2h SuperPod used to perform these training jobs has an estimated retail price of some $38 million. Consequently, Google seeks to exploit their advantage as the only major public cloud provider to deliver AI supercomputing as a service to researchers and AI developers, all using their in-house developed Tensor Processing Units (TPUs) as their alternative to Nvidia GPUs.
The new results are truly impressive. Both Nvidia and Google claim #1 performance spots in three of the six “Max Scale” benchmarks. Nvidia was able to reduce their run-times dramatically (up to 80%) using the identical V100 TensorCore accelerator in the DGX2h building block. Many silicon startups are now probably explaining to their investors why their anticipated performance advantage over Nvidia has suddenly diminished, all due to Nvidia’s software prowess and ecosystem.
The first question, of course, is “where is everyone else?” While there are over 40 companies around the world developing AI-specific accelerators, most are developing chips for “inference,” not model training, where Nvidia enjoys a massive share of the multi-billion-dollar market. For these companies, the mlperf organization plans to release results in early September, just prior to the second AI HW Summit event in Silicon Valley. Even for companies building silicon for training, the staggering costs of competing in this marathon will preclude most if not all startups from participation. Intel should be in the mix, however, once they finish development of their highly anticipated Nervana NNP-T later this year.
So, who “won” and does it matter? Since the companies ran the benchmarks on a massive configuration that maximizes the results with the shortest training time, being #1 may mean that the team was able to gang over a thousand accelerators to train the network, a herculean software endeavor. Since both companies sell 16-chip configurations, and provided those results to mlperf, I have also provided that as a figure of normalized performance.
I find it interesting that Nvidia’s best absolute performance is on the more complex neural network models (reinforcement learning and heavy-weight object detection with Mask R-CNN), perhaps showing that their hardware programmability and flexibility helps them keep pace with the development of newer, more complex and deeper models. I would also note that Google has wisely decided to cast a larger net to capture TPU users, working now to support the popular PyTorch AI framework in addition to Google’s TensorFlow tool set. This will remove one of the two largest barriers to adoption, the other being the exclusivity of TPU in the Google Compute Platform (GCP).
In the end, the answer to the “does it matter” question may be simply to look at the deployment model. Google TPU is a force to be reckoned with in public cloud hosted training in GCP, while Nvidia continues to provide excellent performance for in-house infrastructure and non-GCP public cloud services where their flexibility helps cloud providers amortize their cost over a very wide range of workloads (>500). I would say that Google TPU continues to improve with the now-beta TPU V3 POD, and Nvidia remains able to hold their broad leadership position.
–Karl Freund, senior analyst, machine learning and HPC, Moor Insights & Strategy