Nvidia Reinvents GPU For AI and Data Centers

Article By : Sally Ward-Foxton

Huang’s postponed keynote reveals powerful Ampere architecture for data centers and HPC..

Jensen Huang’s much-anticipated keynote speech earlier, postponed from Nvidia’s GPU Technology Conference (GTC) in March, will unveil the company’s eighth-generation GPU architecture. Emerging three years after the debut of the previous generation Volta architecture, Ampere is said to be the biggest generational leap in the company’s history.

Ampere is built to accelerate both AI training and inference, as well as data analytics, scientific computing and cloud graphics.

The first chip built on Ampere, the A100, has some pretty impressive vital statistics. Powered by 54 billion transistors, it’s the world’s largest 7nm chip, according to Nvidia, delivering more than one Peta-operations per second. Nvidia claims the A100 has 20x the performance of the equivalent Volta device for both AI training (single precision, 32-bit floating point numbers) and AI inference (8-bit integer numbers). The same device used for high-performance scientific computing can beat Volta’s performance by 2.5x (for double precision, 64-bit numbers).

Nvidia DGX-A100 systems are installed at the Argonne National Laboratory, where they are being used in the fight against Covid-19 (Image: Argonne National Laboratory)

Hundred billion-dollar industry
In a press pre-briefing ahead of Huang’s keynote today, Paresh Kharya, director of product management for accelerated computing at Nvidia said that the cloud is the biggest growth opportunity for the computer industry; it’s a hundred-billion dollar industry growing 40% per year.

“To advance this industry, what if we were able to create a data center architecture so it not only increases the throughput to scale up and scale out applications to meet their insatiable demands, but it’s fungible to adapt as the [workloads] are changing throughout the day. To create this fungible data center, we have to reimagine our GPU,” Kharya said.

Data centres’ infrastructure has become fragmented because of the diversity in applications, meaning they use compute clusters of many different types, including hardware for AI training separate to AI inference servers. The different compute types needed are difficult to predict as demand for applications can vary throughout the day.

Yes, we have the ownership of the circle R registered trademark.

They [Nvidia] decided to use that name for a one generation product … I guess everyone will be talking about us 🙂

— Renée James, CEO, Ampere Computing

“It’s impossible to optimize [today’s] data center for high utilization, or so the costs can be down and servers are running applications all the time,” Kharya said. “[Ampere]…unifies AI training and inference acceleration into one architecture. It provides massive scalability, so a single server can scale up as one giant GPU or scale out as 50 different independent accelerators. Our next generation GPU enables this flexible, elastic, universal acceleration, something that we’ve been seeking for a long time for multiple generations.”

Key innovations
Ampere is built on several new key technologies.

Nvidia invented a new number format for AI, Tensor Float 32 (TF32), which its third generation Tensor Cores support. For AI acceleration, working with the smallest number of bits is desirable, since that’s more efficient for computation and data movement, but this is traded off with the accuracy of the final result. TF32 aims to strike this balance using the 10-bit mantissa (which determines precision) from half-precision numbers (FP16), and the 8-bit exponent (which determines the range of numbers that can be expressed) from single-precision format (FP32) (read more about AI number formats here).

“With this new precision, A100 offers 20 times more compute for single-precision AI, and because developers can continue to use the inputs as single-precision and get outputs back as single-precision, they do not need to do anything differently. They benefit from this acceleration automatically out of the box,” Kharya said.

The Tensor Cores now also natively support double-precision (FP64) numbers, which more than doubles performance for HPC applications.

The A100 GPU supports Nvidia’s new number format, Tensor Float 32 (Image: Nvidia)

Nvidia is also using a feature of neural networks, called sparsity, to reduce their size and therefore double performance. Sparsity is a well-known phenomenon which involves identifying which branches of a network have little to no effect on the outcome and ignoring them, but it is complex to implement. Nvidia claims it has done so and that the results are both predictable and consistent, which doubles performance. This mainly benefits inference, but will also have an effect on training.

Another new capability, multi-instance GPU (MIG) allows multiple applications to run on the same GPU without having to share resources such as memory bandwidth. Previously, individual applications could hog the memory bandwidth, reducing performance for other applications. With MIG, each GPU instance gets its own dedicated resources, including compute cores, memory, cache and memory bandwidth. Each A100 can be partitioned into up to 7 pieces of varying size, and they operate completely independently. This level of flexibility is very attractive for data centers’ variable workloads which often comprise many jobs of different sizes.

In terms of delivered performance, Nvidia benchmarked its A100 against its leading Volta GPU running BERT (a large transformer network for natural language processing which, with 350 million parameters, is 10x the size of ResNet-50). The A100 beat its predecessors’ speed by a factor of 7. Further, if the A100 is partitioned (as above), it can run 7 BERT networks concurrently, in real time.

Nvidia has built eight of these A100s into a cloud accelerator system it calls the DGX-A100, which offers 5 petaflops. Each DGX-A100 offers 20x the peak performance of previous generation DGX systems. Unlike predecessors, the new systems can be used not only for AI training but also for scale-up applications (data analytics) and scale-out applications (inference). Nvidia’s figures have a single rack of five DGX-A100s replacing 25 data centre racks of CPUs, consuming 1/20th of the power and costing a tenth of the capex for an equivalent CPU-based system.

Each DGX-A100 system offers 5 petaflops, a big computing milestone (Image: Nvidia)

A reference design for a cluster of 140 DGX-A100 systems with Mellanox HDR 200Gbps InfiniBand interconnects, the DGX-superPOD, can achieve 700 petaflops for AI workloads. Nvidia has built a DGX-superPOD as part of its own Saturn-V supercomputer, and the system was stood up from scratch within three weeks. Saturn-V now has nearly 5 exaflops of AI compute, making it the fastest AI supercomputer in the world.

DGX-A100 systems are already installed at the US Department of Energy’s Argonne National Laboratory where they are being used to understand and fight Covid-19.

Leave a comment