Software Is a Big Deal for AI Inference Accelerators

Article By : Geoff Tate

Inference chips typically have lots of MACs and memory but actual throughput on real-world models is often lower than expected. Software is usually the culprit.

Inference accelerators represent an incredible market opportunity not only to chip and IP companies, but also to the customers who desperately need them. As inference accelerators come to market, a common comment we hear is: “Why is my inference chip not performing like it was designed to?”

Oftentimes, the simple answer is the software.

Software is key
All inference accelerators today are programmable because customers believe their model will evolve over time. This programmability will allow them to take advantage of enhancements in the future, something that would not be possible with hard-wired accelerators. However, customers want this programmability in a way where they can get the most throughput for a certain cost, and for a certain amount of power. This means they have to use the hardware very efficiently. The only way to do this is to design the software in parallel with the hardware to make sure they work together very well to achieve the maximum throughput.

Object detection and recognition
One of the highest-volume applications for inference acceleration today is object detection and recognition (Image: Flex Logix)

One of the biggest problems today is that companies find themselves with an inference chip that has lots of MACs and tons of memory, but actual throughput on real-world models is lower than expected because much of the hardware is idling. In almost every case, the problem is that the software work was done after the hardware was built. During the development phase, designers have to make many architectural tradeoffs and they can’t possibly do those tradeoffs without working with both the hardware and software — and this needs to be done early on. Chip designers need to closely study the models, and then build a performance estimation model to determine how different amounts of memory, MACs, and DRAM would change relevant throughput and die size; and how the compute units need to coordinate for different kinds of models.

Today, one of the highest-volume applications for inference acceleration is object detection and recognition. That is why inference accelerators must be very good at mega-pixel processing using complex algorithms like YOLOv3. To do this, it is critical that software teams work with hardware teams throughout the entire chip design process — from performance estimation to building the full compiler and when generating code. As the chip designer has the chip RTL done, the only way to verify the chip RTL at the top level is to run entire layers of models through the chip with mega-pixel images. You need to have the ability to generate all the code (or bit streams) that control the device and that can only be done when software and hardware teams work closely together.

Today, customer models are neural networks and they come in ONNX or TensorFlow Lite. Software takes these neural networks and applies algorithms to configure the interconnect and state machines that control the movement of data within the chip. This is done in RTL. The front end of the hardware is also written in RTL. Thus, the engineering team that is writing the front-end design is talking a similar language to the people that are writing the software.

Why software also matters in the future
Focusing on software is not only critical early on, but will also be critical in the future. Companies that want to continue delivering improvements are going to need their hardware teams studying how the software is evolving and how the models emerging are shifting in a certain direction. This will enable chip designers to make changes as needed, while the company also improves their complier and algorithms to better utilize the hardware over time.

In the future, we expect companies to continue bringing very challenging models to chip designers, with the expectation that new inference accelerators can deliver the performance needed to handle those models. Like we see today, many chip companies may try and cut costs and development times by focusing more on the hardware initially. However, when the chips are done and delivered to market, it’s going to be the ones that focused on software early on that will offer the best performance and succeed.

— Geoff Tate is CEO of Flex Logix Technologies

Leave a comment