MLPerf rolled out an initial suite of five benchmarks to measure inference jobs on servers, PCs, handsets and some embedded systems
The MLPerf consortium released benchmarks to measure inference tasks on servers and client systems. Tools to run MLPerf Inference v0.5 are available on the group’s web site, but vendors are not expected to start posting results using the metrics until October.
Initially, MLPerf Inference includes five benchmarks to measure performance and power efficiency of tasks run on a smartphone, PC, server and some embedded systems. They consist of two tests for image classification, two for object detection and one machine translation, each with a defined model and data set.
The metrics target “many, but not all embedded systems,” said David Kanter, the inference group’s co-chair. “Our load generator is written in C++ (instead of Python), which enables running on much more resource-constrained systems,” he said.
“However, there are probably some platforms that are so small or specialized for cost/power reasons that they will not be able to run our load generator or our networks. For example, we don't have a suitable benchmark for a system which only does wake-word detection,” Kanter added.
The benchmarks will measure average active power consumed during inference and active idle power when the system is waiting for inference queries. It includes several metrics to describe raw performance specific to different scenarios along with rules for applying them.
For example, for a single stream it measures responsiveness based on 90th percentile latency to answer a query. It can also report the number of simultaneous inference streams a system can sustain. Two other metrics measure throughput for online and offline batch systems.
So far, 17 organizations expressed interest in submitting inference results on a total of 196 combinations of different models and scenarios that range from a single stream on a handset to a batch job on a server. The vision benchmarks use the ImageNet dataset and MS-COCO widely used in robotics, automation, and automotive. The translation test leverages an existing English-German benchmark.
MLPerf provides both the benchmarks specs and reference code implementations in ONNX, PyTorch, and TensorFlow frameworks to run them. The code defines the problems, models, and quality targets, and provide instructions to run the benchmarks.
The specs and tools were developed over 11 months by the group that included members from Arm, Cadence, Centaur Technology, Facebook, Futurewei, General Motors, Google, Habana Labs, Harvard University, Intel, MediaTek, Microsoft, Nvidia, and Xilinx. So far, the group has “had very minimal discussions about the v0.6 version,” that will form its next step, said Vijay Janapa Reddi, an associate professor at Harvard who is the inference group’s other co-chair.
“We downsized the original batch of models we had in mind to meet the current v0.5 timeline,” said Reddi. “In the v0.6 version, we will have more models and likely update some of the models for the existing tasks,” he said.
“Other questions we always ask ourselves is ‘How can we get a larger universe of submitters,’ and ‘How can we make submitting easier,’” said Kanter, adding that future versions may also be capable of running on more constrained embedded systems.
MLPerf is a collaboration among more than 40 companies and universities. Last year, the group rolled out its Training v0.5 benchmark suite. So far, Google, Intel and Nvidia released scores of their chips using it.