Attention-based networks have revolutionized natural language processing. They could do the same for embedded vision, says Perceive's CEO.
Embedded vision technologies are giving machines the power of sight, but today’s systems still fall short of understanding all the nuances of an image. An approach used for natural language processing could address that.
Attention-based neural networks, particularly transformer networks, have revolutionized natural language processing (NLP), giving machines a better understanding of language than ever before. This technique, which is designed to mimic cognitive processes by giving an artificial neural network an idea of history or context, has produced much more sophisticated AI agents than older approaches that also employ memory, such as long short-term memory (LSTM) and recurrent neural networks (RNNs). NLP now has a deeper level of understanding of the questions or prompts it is fed and can create long pieces of text in response that are often indistinguishable from what a human might write.
Attention can certainly be applied to image processing, though its use in computer vision has been limited so far. In an exclusive interview with EE Times, AI expert Steve Teig, CEO of Perceive, argued that attention will come to be extremely important to vision applications.
The attention mechanism looks at an input sequence, such as a sentence, and decides after each piece of data in the sequence (syllable or word) which other parts of the sequence are relevant. This is similar to how you are reading this article: Your brain is holding certain words in your memory even as it focuses on each new word you’re reading, because the words you’ve already read combined with the word you’re reading right now lend valuable context that help you understand the text.
Teig’s example is:
The car skidded on the street because it was slippery.
As you finish reading the sentence, you understand that “slippery” likely refers to the street and not the car, because you’ve held the words “street” and “car” in memory, and your experience tells you that the relevance connection between “slippery” and “street” is much stronger than the relevance connection between “slippery” and “car.” A neural network can try to mimic this ability using the attention mechanism.
The mechanism “takes all the words in the recent past and compares them in some fashion as a way of seeing which words might possibly relate to which other words,” said Teig. “Then the network knows to at least focus on that, because it’s more likely for “slippery” to be [relevant to] either the street or the car and not [any of the other words].”
Attention is therefore a way to focus on reducing the sequence of the presented data to a subset that might possibly be of interest (perhaps the current and previous sentences only), and then assigning possibilities to how relevant each word is likely to be.
“[Attention] ended up being a way of making use of time, in a somewhat principled way, without the overhead of looking at everything that ever happened,” Teig said. “This caused people, even until very recently, to think that attention is a trick with which one can manage time. Certainly, it has had a tremendously positive impact on speech processing, language processing, and other temporal things. Much more recently, just in the last handful of months, people have started to realize that maybe we can use attention to do other focusing of information.”
Neural networks designed for vision have made very limited use of attention techniques so far. Until now, attention has been applied alongside convolutional neural networks (CNNs) or used to replace certain components of a CNN. But a recent paper by Google scientists (“An Image is Worth 16 × 16 Words: Transformers for Image Recognition at Scale,” October 2020) argues that the concept of attention is more widely applicable to vision. The authors show that a pure transformer network, a type of network widely used in NLP that relies on the attention mechanism, can perform well on image classification tasks when applied directly to a sequence of image patches. The transformer network built by the researchers, Vision Transformer (ViT), achieved superior results to CNNs but required fewer compute resources to train.
While it may be easy to imagine how attention applies to text or spoken dialogue, applying the same concept to a still image (rather than a temporal sequence such as a video) is less obvious. In fact, attention can be used in the spatial, rather than the temporal context here. Syllables or words would be analogous to patches of the images.
Teig’s example is a photo of a dog. The patch of the image that shows the dog’s ear might identify itself as an ear, even as a particular type of ear that is found on a furry animal, or a quadruped. Similarly, the tail patch knows it is also found on furry animals and quadrupeds. A tree patch in the background of the image knows that it has branches and leaves. The attention mechanism asks the ear patch and the tree patch what they have in common. The answer is, not a lot. The ear patch and the tail patch however, do have a lot in common; they can confer about those commonalities, and maybe the neural network can find a larger concept than “ear” or “tail.” Maybe the network can understand some of the context provided by the image to work out that ear plus tail might equal dog.
“The fact that the ear and the tail of the dog are not independent allows us to have a terser description of what’s going on in the picture: ‘There is a dog in the picture,’ as opposed to, ‘There’s a brown pixel next to a grey pixel, next to …’ which is a terrible description of what’s going on in the picture,” said Teig. “This is what becomes possible as the system describes the pieces of the image in these semantic terms, so to speak. It can then aggregate those into more useful concepts for downstream reasoning.”
The eventual aim, Teig said, would be for the neural network to understand that the picture is a dog chasing a Frisbee.
“Good luck doing that with 16 million colors of pixels,” he said. “This is an attempt to process that down to, ‘There’s a dog; there’s a Frisbee; the dog is running.’ Now I have a fighting chance at understanding that maybe the dog is playing Frisbee.”
A step closer
Google’s work on attention in vision systems is a step in the right direction, Teig said, “but I think there’s a lot of room to advance here, both from a theory and software point of view and from a hardware point of view, when one doesn’t have to bludgeon the data with gigantic matrices, which I very much doubt your brain is doing. There’s so much that can be filtered out in context without having to compare it to everything else.”
While the Google research team’s solution used compute resources more sparingly than CNNs do, the way attention is typically implemented in NLP makes networks like transformers extremely resource-intensive. Transformers often build gigantic N × N matrices of syllables (for text) or pixels (for images) that require substantial compute power and memory to process.
“The data center guys out there think, ‘Excellent — we have a data center, so everything looks like a nail to us,’ ” said Teig, and that’s how we’ve ended up with NLP models like OpenAI’s GPT-3, with its 175 billion parameters. “It’s kind of ridiculous that you’re looking at everything when, a priori, you can say that almost nothing in the prior sentence is going to matter. Can’t you do any kind of filtering in advance? Do you really have to do this crudely just because you have a gigantic matrix multiplier…? Does that make any sense? Probably not.”
Recent attempts by the scientific community to reduce the computational overhead for attention have reduced the number of operations required from N2 to N√N. But those attempts perpetuate “the near-universal belief — one I do not share — that deep learning is all about matrices and matrix multiplication,” Teig said, pointing out that the most advanced neural-network research is being done by those with access to massive matrix multiplication accelerators.
Teig’s perspective as CEO of Perceive, an edge-AI accelerator chip company, is that there are more efficient ways of conceptualizing neural-network computation. Perceive is already using some of these concepts, and Teig thinks similar insights will apply to the attention mechanism and transformer networks.
“I think the spirit of what attention is talking about is very important,” he said. “I think the machinery itself is going to evolve very quickly over the next couple of years… in software, in theory, and in hardware to represent it.”
Is there an eventual point where today’s huge transformer networks will fit onto an accelerator in an edge device? Part of the problem, in Teig’s view, are networks like GPT-3’s 175 billion parameters —roughly 1 trillion bits of information (assuming 8-bit parameters for the sake of argument).
“It’s like we’re playing 20 questions, only I’m going to ask you a trillion questions in order to understand what you’ve just said,” he said. “Maybe it can’t be done in 20,000 or 2 million, but a trillion — get out of here! The flaw isn’t that we have a small 20-mW chip; the flaw there is that [having] 175 billion parameters means you did something really wrong.”
Reducing attention-based networks’ parameter count, and representing them efficiently, could bring attention-based embedded vision to edge devices, according to Teig. And such developments are “not far away.”
This article was originally published on EE Times.
Perceive CEO Steve Teig will speak twice at the Embedded Vision Summit. In “Facing Up To Bias,” he will discuss sources of discrimination in AI systems, and in “TinyML Isn’t Thinking Big Enough,” he will challenge the notions that TinyML models must compromise on accuracy and that they should run on CPUs or MCUs.