Hardware-software solution uses sparsity to accelerate AI recommendation models that are typically memory-bound...
A British SME, Myrtle.ai, is helping hyperscale data centers accelerate the type of AI models commonly used in recommendation engines with a hardware-software solution that has the potential to save hyperscalers millions of dollars per year. Myrtle’s product, Seal, is a hardware-software solution which accelerates certain memory-bound operations that are key to AI recommendation model inference.
AI recommendation models power large parts of the internet as we know it, serving adverts and selecting personalized content for social network news feeds. These models, computed in the cloud, make up a significant portion of hyperscale data center workloads today.
Myrtle is a hardware-software engineering house based in Cambridge, UK. Seal was born out of the company’s involvement with the MLPerf AI benchmarking organization – in helping design the benchmarks, the Myrtle team realized how much revenue recommendation models generate for hyperscalers, and set about designing an accelerator that would deliver large gains in latency-bounded throughput in existing infrastructure.
“The memory and compute patterns within recommendation models are unusual,” explained Myrtle CEO Peter Baldwin, in an interview with EE Times. “They have a very specific bottleneck which is suffered by all the hyperscalers within their sparse features. This naturally fell within our sparsity capability – our machine learning team has a great ability to induce sparsity through training. And we were able to translate that into efficient hardware that can exploit it.”
Density and sparsity
Most AI models have dense and sparse areas, but the effect is particularly pronounced in recommendation models. Sparse areas are parts of the neural network where most of the weights are zero – computing these branches of the network uses a similar amount of energy and time as dense areas, even though the result is often zero. If a particular network is sparse, there are efficiencies to be gained by skipping the parts that always come to zero. Myrtle’s previous work involves inducing sparsity by training the network in a special way, such that models can easily be compressed or pruned to become smaller and more efficient.
“Seal was designed to fit into available slots within existing infrastructure, to allow the dense compute accelerators that people already use to do more of what they’re good at by offloading the things that they’re not so good at,” Baldwin said. “We are giving the hyperscalers more bang for their buck by offloading those memory bound operations and inference that caused them such headaches.”
Seal’s hardware is a memory module in the Open Compute Project M.2 form factor, intended to fit into Glacier Point carrier cards. Each module has either 16 or 32 GB DDR4 memory plus an FPGA which runs Myrtle’s accelerator code. While it was designed to work with an Intel Xeon CPU, the compute type most commonly used to process AI recommendation models in hyperscale data centres, it also fits alongside GPU or ASIC accelerators.
Seal’s software portion offloads memory-bound inference operations by providing versions of those operations that can be offloaded to the Seal hardware. Customers use their existing software stack, using the alternative Seal versions of the memory bound operations in PyTorch.
“You can offload the memory-bound operations to the Seal stick [hardware] and as you do that, your compute resources become freed up,” Baldwin said. “You can therefore run that harder, with a higher batch size, until it becomes memory-bound again, then just plug in another Seal stick… we like to say that Seal is a virtuous circle: you can basically keep plugging in Seal sticks until your inference becomes compute bound again.”
Crucially, models do not have to be retrained or changed in any way, critical for customers whose revenue depends directly on adverts served by very complex, very finely tuned, proprietary recommendation models which are already deployed at large scale.
Myrtle’s secret sauce is in its patented sparsity IP, which the company has implemented optimally on the Seal hardware. This IP accelerates operations that involve multiplying a sparse vector and a dense matrix. This can be used to accelerate any application which has a fundamental pattern of a sparse vector times a dense matrix, but “it has been targeted at recommendation systems because this is where the key pain point is within running those models. And that’s where the sales are,” Baldwin said.
“It’s the right software stack with the right optimizations driving the right hardware architecture, and actually bringing all those elements together is the thing that really unlocks the performance,” said Liz Corrigan, Myrtle’s senior engineering manager.
Tests performed on recommendation models (the same models which will make up part of the MLPerf recommendation model benchmark when it is launched later this year) reveal that Seal hardware can offer an 8X latency bound throughput improvement. The 16GB Seal stick will deliver an 18 GB/s vector processing bandwidth, and the 32 GB stick will deliver 16 GB/s.
Breaking the memory bottleneck with a practical solution that complements existing infrastructure has been quite a challenge, particularly the thermal and size constraints involved with the M.2 form factor.
Another challenge for the British 31-person engineering firm will be getting in the door of US- and China- based hyperscalers such as Amazon and Google. Baldwin said a good reputation built up within the MLPerf community and working groups will certainly help, but that Seal is the right product at the right time for the hyperscalers.
“It will certainly be put through its paces, and we have huge hopes for it based on the raw technical excellence that Seal shows us today,” he said. “We have a history of punching well above our weight, and I think our continued involvement with MLPerf is really our badge of technical credibility. We wouldn’t have understood the recommendation model problem well enough at such a fine-grained level to even attempt to solve it, If we hadn’t have had that level of involvement [with MLPerf].”
Myrtle’s first customers for Seal will be evaluating the solution during the third quarter of 2020. The company is also reviewing alternative form factors for Seal that will be applicable to more use cases, including E1.S and dual-slot M.2 form factors.
“In the AI market we think we’re going to see more very specific solutions that don’t pretend to be all things to all men, but are for very specific large scale problems,” Baldwin added. “Basically, allowing hyperscalers to do more with the infrastructure they have already and slowing the massive increase in footprint of servers that are needed to run the inference of recommendation models.”