MCU-based Implementation for Deep Learning at the Edge

Article By : Markus Levy

Deep learning on MCUs is perfectly possible, despite TOPS ratings that don't tell the full story....

Just a few years ago, it was assumed that machine learning (ML) — and even deep learning (DL) — could only be performed on high-end hardware, with training and inference at the edge executed by gateways, edge servers, or data centers. It was a valid assumption at the time because the trend toward distributing computational resources between the cloud and the edge was in its early stages. But this scenario has changed dramatically thanks to intensive research and development efforts made by industry and academia. The result is that today, processors capable of delivering many trillions of operations per second (TOPS) are not required to perform ML. In an increasing number of cases, the latest microcontrollers, some with embedded ML accelerators, can bring ML to edge devices. Not only can these devices perform ML, they can do it well, at low cost, with very low power consumption, connecting to the cloud only when absolutely necessary. In short, microcontrollers with integrated ML accelerators represent the next step in bringing computing to sensors such as microphones, cameras, and those monitoring environmental conditions, that generate the data upon which all the benefits of IoT are realized. How deep is the edge? While the edge is broadly considered the furthest point in an IoT network, it’s generally regarded as an advanced gateway or edge server. However, that’s not where the edge actually ends. It ends at the sensors near the user. It becomes logical to place as much analytical power near the user as possible, a task for which microcontrollers are ideally suited.
Deep Learning on MCUs
MobileNet V1 model examples of varying width multipliers show a drastic impact on the number of parameters, computations, and accuracy. However, just changing the width multiplier from 1.0 to 0.75 minimally affects the TOP-1 accuracy but significantly impacts the number of parameters and computations (Image: NXP)
A case could be made that single-board computers can also be used for edge processing, as they’re capable of remarkable performance and, when in clusters, can rival a small supercomputer. But they’re still too large and too costly to be deployed in the hundreds or thousands required in large-scale applications. They also require an external source of DC power that in some cases may be beyond what is available, while an MCU consumes only milliwatts and can be powered by coin cell batteries or even a few solar cells. So, it’s not surprising that interest in microcontrollers for performing ML at the edge has become a very hot area of development. It even has a name – TinyML. The goal of TinyML is to allow inferencing, and ultimately training, to be executed on small, resource-constrained low-power devices, and especially microcontrollers, rather than larger platforms or in the cloud. This requires neural network models to be reduced in size to accommodate the comparatively modest processing, storage, and bandwidth resources of these devices, without significantly reducing functionality and accuracy. These resource-optimized schemes allow the devices to ingest sufficient sensor data to serve their purpose while fine-tuning accuracy and reducing the resource requirements. So, while data might still be sent to the cloud (or perhaps first to an edge gateway and then to the cloud) there will be much less of it because considerable analysis has already been performed. A popular example of TinyML in action is a camera-based object detection system that, while capable of capturing high-resolution images, has limited storage and requires a reduction in the image resolution. However, if the camera includes on-device analytics, only objects of interest are captured rather than the entire scene, and as the relevant images are fewer, their higher resolution can be retained. This capability is typically associated with larger, more powerful devices, but tiny ML technology allows it to happen on microcontrollers. Small but mighty Although TinyML is a relatively new paradigm, it is already producing surprising results for inferencing (with even relatively modest microcontrollers) and training (on more powerful ones) with minimal accuracy loss. Recent examples include voice and facial recognition, voice commands and natural language processing, and even running several complex vision algorithms in parallel. Practically speaking, this means that a microcontroller costing less than $2 with a 500-MHz Arm Cortex-M7 core and from 28 Kbytes to 128 Kbytes of memory can deliver the performance required to make sensors truly intelligent. Even at this price and performance level, these microcontrollers have multiple security functions, including AES-128, support for multiple external memory types, Ethernet, USB, and SPI, and either include or support for various types of sensors, as well as Bluetooth, Wi-Fi, and SPDIF and I2C audio interfaces. Spend a little more, and the device will typically have a 1-GHz Arm Cortex-M7, 400-MHz Cortex-M4, 2 Mbytes of RAM, and graphics acceleration. Power consumption is typically no more than a few milliamps from a 3.3 VDC supply.
Machine Learning use cases
Machine learning use cases (Image: NXP)
A few words about TOPS Consumers are not alone when they use a single metric to define performance; designers do it all the time, and marketing departments love it. This is because a headline specification makes differentiation between devices simple, or so it would seem. A classic example is the CPU, which for many years was defined by its clock rate. Fortunately for both designers and consumers, this is no longer the case. Using just one metric to rate a CPU is akin to evaluating a car’s performance by the engine’s redline. It is not meaningless, but has little to do with how powerful the engine is or how well the car will perform because many other factors together determine these characteristics. Unfortunately, the same is increasingly true for neural network accelerators, including those within high-performance MPUs or microcontrollers, that are specified by billions or trillions of operations per second because, once again, it is an easy number to remember. But in practice, GOPS and TOPS alone are relatively meaningless metrics and represent a measurement (no doubt the best one) made in a lab rather than representing an actual operating environment. For example, TOPS does not consider the memory bandwidth’s limitations, the required CPU overhead, pre-and post-processing, and other factors. When all these and others are considered, such as performance when employed on a specific board in actual operation, system-level performance could likely be 50% or 60% of the TOPS value on the datasheet. All these numbers tell you is the number of computation elements in the hardware multiplied by their clock speed, rather than how often it will have the data available when it needs to function. If data were always immediately available, power consumption was not an issue, memory constraints did not exist, and the algorithm was seamlessly mapped to the hardware, they would be more meaningful. But the real world presents no such ideal environments. When applied to ML accelerators in microcontrollers, the metric is even less valuable. These tiny devices typically have a value of 1 to 3 TOPS but can still deliver the inference capabilities required in many ML applications. These devices also rely on Arm Cortex processors specifically designed for low-power ML applications. Along with support for both integer and floating operations and the many other features in the microcontroller, it becomes obvious that TOPS, or any other single metric, is incapable of adequately defining performance either alone or in a system. Conclusion The desire to perform inferencing on microcontrollers directly on or attached to sensors, such as still and video cameras, is now emerging as the IoT domain moves closer to performing as much processing as possible at the edge. That said, the pace of development of application processors and neural network accelerators within microcontrollers is swift, and more proficient solutions are frequently appearing. The trend is toward consolidating more AI-centric functionality such as neural network processing along with an application processor in the microcontroller without dramatically increasing power consumption or size. Today, models can be trained on a more powerful CPU or GPU and then implemented on a microcontroller using inference engines such as TensorFlow Lite to reduce them in size to meet the microcontroller’s resource requirements. Scaling can easily be performed to accommodate greater ML requirements. Soon it should be possible to perform not just inferencing but training on these devices, which will effectively make the microcontroller an even more formidable competitor to larger and more expensive computing solutions. —Markus Levy is director of AI and machine learning technologies at NXP Semiconductors

Leave a comment