MLPerf Launches TinyML Benchmark for Smallest AI Systems

Article By : Sally Ward-Foxton

New TinyML benchmark has metrics for latency and energy consumption - and the first round of results are out.

MLCommons, the industry consortium behind the MLPerf benchmark suite for machine learning (ML) systems, has launched a performance benchmark designed for TinyML systems. The consortium also released the first round of scores submitted for the newly created MLPerf Tiny Inference benchmark.

MLPerf already offers benchmarks for HPC, data center and mobile-scale systems. The new benchmark is for TinyML systems – those that process machine learning workloads in extremely resource-constrained environments.

“[The MLPerf Tiny Inference benchmark] completes the microwatts to megawatts spectrum of machine learning,” said David Kanter, Executive Director of MLCommons. “If you look at some of our training and HPC benchmarks, the HPC benchmark is running on 16,000 nodes on the world’s largest supercomputer. On the tiny side, it’s about how do we measure performance for the smallest and lowest power devices out there.”

MLPerf TinyML benchmark range
The range of devices covered by MLPerf’s benchmarks – from the TinyML benchmark up to data center devices (Image: MLCommons)

Typically, a TinyML system means an embedded microcontroller-class processor performing inference on sensor data locally at the sensor node, whether that’s microphone, camera or some other kind of sensor data. A typical neural network in this class of device might be 100 kB or less, and usually the device is restricted to battery power.

While there is no exact definition of TinyML, the term generally refers to microcontroller-based systems. MLPerf has stretched this a little so it incorporates systems up to and including Raspberry Pi-class systems.

Developing benchmarks for this sector has been challenging, said MLPerf Tiny Inference working group chair, Harvard University Professor Vijay Janapa Reddi.

“Any inference system has a complicated stack, but [with TinyML], everything is to do with sensor data – audio, visual, IMU – the ecosystem is especially complex,” said Janapa Reddi. “It gets especially challenging in the embedded space because a lot of embedded hardware has custom tool chains… that makes the benchmarking space extremely challenging. We had to custom build a lot of infrastructure from the ground up, there was nothing that could easily be borrowed from the MLPerf Inference benchmark.”

Defining a fixed benchmark to effectively showcase innovation in hardware, software, tooling, and algorithms was a particular challenge for the TinyML space, given there is widespread innovation in all parts of the stack, he added.

Workload choices
Developed in partnership with EEMBC, the Embedded Microprocessor Benchmark Consortium, the new benchmark uses EEMBC’s test harness (the EnergyRunner framework), while MLPerf’s working groups defined the workloads, rules and benchmark definition.

As with other MLPerf benchmarks, organisations can submit scores for hardware and software systems running one or more of several different workloads. For the TinyML benchmark, the number and diversity of use cases for TinyML systems made choosing workloads to represent common use cases particularly difficult. The MLPerf Tiny Inference working group narrowed it down to four workloads:

  • Keyword spotting: Limited vocabulary speech recognition with DS-CNN model using Google Speech Commands Dataset
  • Anomaly detection: Audio time series anomaly detection using machine operating sounds dataset ToyADMOS with Deep Autoencoder model
  • Visual wake words: This workload is a two-class image classification – images are classified as ‘person’ or ‘not-person’. Visual Wake Words Dataset is used with MobileNetV1 0.25X model
  • Image classification: Multiclass image classification (10 classes) with small images from the CIFAR10 dataset using ResNet-8 model

Like the other MLPerf benchmarks, MLPerf Tiny Inference has Closed and Open divisions, in an attempt to offer both comparability between like systems and flexibility for demonstration of novel approaches, and to allow submitters to show their value-add, no matter which part of the stack they are focused on.

The performance metrics settled on by the working group are latency for a given prediction accuracy, and energy consumption for a given prediction accuracy.

While latency scores are always required, energy measurements are optional. Since TinyML systems are often a carefully balanced compromise between power and performance, can we really get a clear picture of system performance without seeing both these metrics together?

“Part of the reason we called this version of the benchmark v0.5 is this is our first set of results for MLPerf Tiny Inference,” said MLCommons’ David Kanter. “Getting the results and making the rules and building the benchmark suite is actually a pretty significant undertaking. And then generating the power/energy results on top of that adds yet another layer of complexity… I am a big fan of the crawl, walk, run approach – getting things up and running and then optimized, and then maybe adding in the additional complexity of energy or power measurement. I think you’ll probably see a lot more energy measurement in our next round of results.”

Working group chair Vijay Janapa Reddi agreed, adding that the TinyML benchmark will serve to provide clarity for the industry as it develops.

“This is a field that is still budding, and it’s trying to find its footing,” he said. “We can wait three years for this field to mature with TOPS and TOPS per Watt numbers splattered all over the place, and then try and get some standardization, or we can come in at the start and work with the industry to help set them in a direction that makes sense for them… For me, it’s not about the exact numbers or the exact systems and so on. It’s much more about bringing that clarity and vision to this community so they can accelerate their forward progress.”

TinyML benchmark stack complexity
The technology stack for TinyML systems is extremely complex (click to enlarge) (Image: MLCommons)

The landscape of companies in the TinyML sector looks very different to data center system companies; there are many more startups and SMEs. MLPerf’s TinyML benchmark’s working group has taken this into account, said working group co-chair Colby Banbury.

“We thought about that from the beginning with our benchmark design,” he said. “We put a lot of emphasis on the reference implementation and trying to build that out, I think to a level of importance that wasn’t necessarily had in previous iterations of MLPerf Inference because there wasn’t as much of a need.”

The reference implementation provided by the working group is a set of latency and power scores for all the workloads running on an STMicro Nucleo-L4R5ZI board, selected for its open platform, wide availability and affordability. The board has an STM32 Arm Cortex-M4 MCU on it. The entire implementation is made available for potential submitters to use as a jumping off point for their own systems, if required.

In theory, a software vendor could take the reference implementation stack and swap out their specific component and run it fairly easily, for example, said Banbury.

First results
The first round of results for MLPerf Tiny Inference constitute four submissions (plus the reference systems) in the Closed division, and one submission in the Open division.

In the Closed division, Latent AI provided two submissions for its software-only solution running on a Raspberry Pi. The company’s hardware agnostic Latent AI Efficient Inference Platform (LEIP) SDK was used to optimize for compute, energy and memory efficiency. Latent AI submitted latency scores across all four workloads for FP32 and INT8 quantization of the models; it handled the keyword spotting workload in 0.39 ms (for FP32) or 0.42 ms (for INT8), for example, compared to the reference system’s 181.92 ms.

Chinese research facility Peng Cheng Laboratory submitted scores across all four workloads as a proof-of-concept for its custom RISC-V microcontroller device, which is designed for TinyML applications. This system ran the keyword spotting workload in 325.63 ms, compared to 181.92 ms for the reference implementation, for example.

Syntiant’s submission, notably the only one to use a hardware accelerator, managed the keyword spotting task with a latency of 5.95 ms (compared to the reference system at 181.92 ms). The company’s NDP120 SoC, designed for keyword spotting, uses an Arm Cortex-M0 CPU core plus Syntiant Core 2 accelerator.

In the Open division, hls4ml was the only submitter. Hls4ml is actually a workflow for neural network optimization which was developed for the Large Hadron Collider at CERN and is now developed by the Fast Machine Learning for Science research community. Hls4ml-optimized models were run on a dual core Arm Cortex-A9 CPU plus a Xilinx FPGA accelerator. It managed the image classification workload with a latency of 7.9 ms and 77% accuracy. The same system handled the anomaly detection workload in 0.096 ms with an accuracy of 82%.

Other than for the reference implementation, no energy consumption scores were submitted in this round. The full set of scores may be viewed here.

This article was originally published on EE Times.

Sally Ward-Foxton covers AI technology and related issues for and all aspects of the European industry for EETimes Europe magazine. Sally has spent more than 15 years writing about the electronics industry from London, UK. She has written for Electronic Design, ECN, Electronic Specifier: Design, Components in Electronics, and many more. She holds a Masters’ degree in Electrical and Electronic Engineering from the University of Cambridge.

Leave a comment