A year after announcing its in-house designed AI accelerator chip, Amazon Web Services (AWS) is making instances based on its Inferentia chip available for customer workloads.

AWS’ customers across a diverse set of industries are moving beyond the experimental stage with machine learning, and are now scaling up ML workloads. They are therefore ready for the increase in performance and efficiency Inferentia will bring, the company said.

Andy Jassy, CEO of AWS, presents the company’s new offering for inference compute at Amazon’s Re:Invent conference (Image: AWS)

Andy Jassy, CEO of AWS, pointed out in his keynote at AWS’ Re:Invent conference last week that for machine learning systems at scale, 80-90% of the compute cost is in inference.

“We’ve talked a lot as a group about training for machine learning, it gets a lot of the attention. They are hefty loads,” he said. “But if you do a lot of machine learning at scale, and in production like we have, you know that the majority of your cost is actually in the predictions or in the inference.”

Using Alexa’s sizeable model as an example, he compared the compute required for training, which happens twice a week, with the compute required to inference every request made to Alexa from every device in the world. Lowering the cost of inference compute for customers is therefore a priority, he said.

AWS is offering access to its EC2 Inf1 instances, which are based on 16 Inferentia chips, available immediately. Compared to AWS’ previous best offering (which it says was also the cheapest in the industry for ML workloads), EC2 G4 instances which are based on the Nvidia T4 GPU, the new instances provide lower latency, up to 3x higher inference throughput, and up to 40% lower cost-per-inference.

Inferentia offers 128 TOPS per chip for 8-bit integer data (Image: AWS)

While not much is known about Inferentia itself, we do know that it offers 128 TOPS per chip for INT8 data (each EC2 Inf1 instance is based on 16 chips and offers 2000 TOPS). We also know it supports multiple data types (including INT-8 and mixed precision FP-16 and bfloat16). Each chip has 4 “Neuron Cores” alongside “a large amount” of on-chip memory. There is an SDK for the chip which can split large models across multiple chips using a high-speed interconnect.

Amazon joins an elite group of datacentre hyperscalers developing their own chips for use in their cloud facilities. Google has its tensor processing unit (TPU), Baidu designed its Kunlun series and Alibaba has its Hanguang 800.

Meanwhile, Microsoft has begun offering Graphcore chips for customer ML workloads as part of Azure.

Facebook is believed to be working on an ASIC for AI acceleration in its datacentres, but has still to reveal its hand, if in fact it is playing this game at all.