Similarity Search Engine Accelerate AI Inference

Article By : Sally Ward-Foxton

Splitting out the search portion from an AI model means it can be hardware-accelerated to dramatically reduce latency...

Memory supplier GSI Technology can accelerate certain types of AI inference using its similarity search accelerator ASIC, Gemini. GSI’s technique separates AI inference into a machine learning driven portion up to and including feature extraction, and then a second portion which uses a similarity search algorithm to search for a match. This similarity search portion can then be accelerated by GSI’s Gemini hardware, which reduces latency dramatically. “Instead of having the trained model do predictions on images, we use the trained model in a way that gives us semantically rich feature vectors,” George Williams, director of data science at GSI Technology told EE Times. “Those feature vectors are then sent to a matching server that can be powered by our technology, to yield the best match. That can be further processed by a downstream notification pipeline.”
GSI Similarity Search Accelerator system
GSI Technology’s technique uses feature extraction from a trained AI model, but then hands off to a similarity search algorithm to match with similar items from a database. This example is for a facial recognition system (Image: GSI Technology)
This technique won’t suit all types of AI inference. It works best on applications where there is a large database of labelled data to be searched, such as facial recognition systems. In this case, instead of using the top-most classification layer from the trained model, the technique uses the feature extractor layer which outputs a smaller, semantically rich representation of the raw data. Then a similarity search is performed against a database of already labelled feature vectors, which can be done very quickly using GSI’s Gemini similarity search accelerator chip, to find a best match for that face. “[This technique] gives you a number of very interesting benefits,” said Mark Wright, GSI Technology’s director of marketing. “One is it gets you a result sooner, so now you can start using this technology for real time applications. Previously by having to send it up to cloud servers, you couldn’t do that because the latency would have been too high.” The second benefit is that there is no re-training needed if you want to update the database. This means it is also well-suited to applications where there is a need to add to the database in the field – perhaps a supermarket CCTV system which can be used to look for a lost child’s face in the footage – the child’s photo can easily be added to the database and searched for quickly.
GSI Technology similarity search accelerator results table
Facial recognition using one Gemini APU (associative processing unit) chip on a single PCIe card. Columns are for databases (DB) of different sizes (384k to 10m records), records are 32-bit floating point vectors with 256 features, K=25 (looking for the top 25 most similar matches). A neural hash algorithm was used in training to fit larger databases onto the single APU chip, but for databases that are even bigger, more than one Gemini card can be linked together. (Image: GSI Technology)
Search algorithms Similarity search algorithms such as nearest neighbor search, approximate nearest neighbor and K nearest neighbor power an increasing variety of large-scale applications. “There has been an explosion in the last few years of similarity search technology,” said Williams. “These algorithms are in use today – for example, eBay’s online object recognition search is based on similarity search at the scale of a billion items. That’s why ecommerce companies are tremendously interested in this technology, because they have billion-scale problems right now with that kind of search.” Williams said that reverse image search and massive text search are moving towards this kind of vector- and AI-based search. Instead of keyword matching, internet search engines use NLP models to do feature extraction, and then a search is performed. And social media users’ interests are reduced to large vectors by the algorithm which uses similarity search to recommend items you may like based on what other users’ similar like patterns. APU Architecture The IP behind the APU (associative processing unit) architecture was acquired by GSI in 2015 with the acquisition of Israel-based MikaMonu, then combined with GSI’s SRAM technology to enable a new, additional direction for the company. Gemini is an APU built on a proprietary in-memory compute architecture. The APU architecture is designed to be very efficient for storing and searching large databases. Mixed into SRAM cells are small processors which are optimized for performing simple calculations on large amounts of data. “Along with similarity search, the device also has a strength in doing simple algorithms that require Boolean operations, such as data manipulation,” said Wright. This includes applications like cryptography.
GSI Technology similarity search accelerator hardware
The architecture of the Gemini APU chip has SRAM cells mixed with programmable bit logic (Image: GSI Technology)
The Gemini chip has four cores. Each core has 16 half-banks of memory which are each divided into 16 sections. Each section combines SRAM cells with programmable bit logic. There are 2 million of these bit processors interleaved with 48 million custom 10T SRAM cells and an L1 cache of 96Mb. The total compute capability is 25 TOPS (for 8-bit computation). The chip is manufactured on TSMC’s 28nm HPC++ process. The result is in-memory compute at low latency thanks to tremendous memory bandwidth, and power-efficient operation (60W thermal design power (TDP) for the chip). There is also a separate 16GB DRAM on the board which can be used for datasets that don’t fit on the chip, or multiple Gemini boards can easily be linked together. Further applications Aside from facial recognition, the technique is also applicable to other AI applications, such as RF signal classification. The technique is the same: use the deep learning model to produce a feature-rich, lower dimensional representation of the raw data in the form of a bit vector, build a database of different types of signal that are labelled with their features, then store the entire database on the Gemini chip and use a KNN (K nearest neighbor) type of similarity search algorithm to find similar signals. The Gemini chip accelerates the similarity search portion of the process to reduce the latency, and the effect is more pronounced for bigger databases. As another example, GSI’s data science team recently won a challenge set out by the Israeli Ministry of Defence (Mafat) which involved designing a neural network to tell the difference between humans and animals in doppler-pulse radar signal segments. This is one of the algorithms GSI is currently working on optimizing for acceleration on Gemini APU hardware. Other AI applications that can be accelerated include natural language processing (NLP) used for semantic search applications (for example, searching for texts with similar meanings) – an AI model does the semantic part, then the search portion could be given over to the similarity search accelerator. GSI also has demonstrations related to cheminformatics, where the Gemini APU is used to search through a database of millions of molecules for similarities (for example, when searching for molecules that could be generic versions of particular drugs). In this case, the APU accelerator makes lower similarity thresholds possible compared to CPUs. Since GSI has rad-hard expertise, a space version of Gemini is a future possibility, according to Wright. In the meantime, Gemini is in production now and full-height half-length Gemini PCIe boards are available today.

Leave a comment