8-bit floating point numbers and in-memory computing to help advance AI
With head-to-head kickoffs for both the International Electron Devices Meeting (IEDM) in San Francisco and the Conference on Neural Information Processing Systems (NeurlPS) in Montreal, this week looms huge for anyone hoping to keep pace with R&D developments in Artificial Intelligence.
IBM researchers, for example, are detailing new AI approaches for both digital and analog AI chips. IBM boasts that its digital AI chip demonstrates, “for the first time, the successful training of deep neural networks (DNNs) using 8-bit floating point numbers while fully maintaining the accuracy on a spectrum of deep learning models and datasets.”
Separately, IBM researchers are showcasing at IEDM an analog AI chip using 8-bit precision in-memory multiplication with projected phase-change memory.
“We do think all this work we are doing — such as trying to get the precision down so that the performance can go up and the power can go down — is really important to continue to advance AI,” Jeffrey Welser, vice president and lab director at IBM Research-Almaden, told EE Times.
This is crucial, said Weiser, as the world moves from “narrow AI” where “we use AI to identify a cat, for example, on the Internet” to “broader AI” where “we analyze medical images, or we want to be able to integrate both text and imaging information together to come up with a solution.”
He added, “All those broader questions require a much larger neural net, much larger data sets and multi-modal data sets coming in... [for that], we need changes in architecture and hardware to make all that happen.”
Weiser described the two papers IBM published this week as “an interesting set of good advances” allowing the industry to move toward that [broader AI] future.
Linley Gwennap, the president of the Linley Group and principal analyst, told EE Times, “Machine learning continues to evolve rapidly. Existing hardware can’t efficiently handle the largest neural networks that researchers have built, so they are looking at all kinds of new ways to improve performance and efficiency.”
These new developments will exert tremendous pressure on hardware vendors, as chip companies “must be flexible and quick to survive in this chaotic market,” Gwennap added.
End of the GPU era for AI
IBM is boldly predicting the end of GPU domination in AI.
IBM’s Welser told EE Times, “A GPU has the ability to do lots of parallel matrix multiplications for graphics processing. Such matrix multiplications happen to be exactly the same thing you need to do with neural nets.” In his view, “That was sort of a coincidence, but it’s been incredibly important. Because without that [GPUs], we would never have achieved the level of performance we are already seeing in AI performance today.” However, Welser added, “As we’ve learned more about what it takes to do AI, we are finding ways to design a hardware that can be more efficient.”
Moving to lower precision
One route to efficiency is to lower the precision required for AI processing.
Welser explained, “The general direction which we all started to realize a few years ago was that while we are used to very precise calculation — 32-bit calculation floating point is very standard, and even 64-bit, double precision for really accurate kind of calculations — that’s not necessarily always important [in AI].”
In AI, he stressed, “What you care about for the neural net is when you show an image or word if it gets the right answer. When we ask if it is a cat or a dog, it says it’s a cat. If it’s the right answer, you don’t necessarily care about all the calculations that go in between.”
Ideally, AI should mimic the human eye. Welser said, “If you look through a foggy window, you see a person walking on the street. It’s a low-position image… but often it’s plenty to be able to say ‘oh, that’s my mom coming.’ So, it doesn’t matter whether that’s the right precision for the vision, as long as you get the right answer.”
This explains the trend toward lower precision in AI processing, he explained.
“For 32-bit calculation, I’ve got to do calculation on 32-bits. If we can do it on 16 bits, that’s basically half the calculation power, or probably half the area or even less on a chip,” Welser went on. “If you can get down to 8 bits or 4 bits, that’s even better.” He said, “So, this gives me a huge win for area, for power, and performance and throughput — how fast we can get through all of this.”
(Source: IBM Research)
However, Welser acknowledged, “For a long time, we thought we’d have to stick with 32-bit precision for AI training. There was just no way around it.”
In 2015, IBM Research launched the reduced-precision approach to AI model training and inference with a paper describing a novel dataflow approach for conventional CMOS technologies. IBM showed models trained with 16-bit precision that exhibits no loss of accuracy compared to models trained at 32 bits.
Since then, IBM observed that “the reduced-precision approach was quickly adopted as the industry standard, with 16-bit training and 8-bit inferencing now commonplace and spurred an explosion of startups and VC investment for reduced precision-based AI chips.” Despite such an emerging trend, “training” with numbers represented less than 16 bits, however, has been viewed almost impossible, given one needs to maintain the high accuracy in models.
How they did it
Welser said IBM made this possible by developing a number of methods which researchers applied to AI processing. For example, he said, “We do have some sections we do in 8 bits, some sections we do it in 16 bits for accumulation, in other sections we chuck up in different parts, so you don’t lose precision in rounding, but you didn’t intend to.”
In other words, the IBM team’s accomplishment is more complicated than universally applying 8-bit calculations to the entire operation. Instead, IBM figured how to apply a combination of methods to different parts of the process.
Welser confirmed, “Yes, that’s exactly right. For example, we can now use 8-bit for all of the weight update process, but we are still using 16-bit for some addition and accumulation step processes. And this turns out to be very important, because 16-bit additions are easier than 16-bit multiplications, so actually it’s helpful to do it in 16-bit.”
Perhaps, more significantly, as Welser noted, the key factor in IBM’s work was “to come up with a data flow architecture that allowed the data to flow through the chip very smoothly, to each of these operations, in a way you don’t end up creating some bottleneck.”
In the end, “We’ve shown you can effectively use 8-bit floating point to get the same level of accuracy that you get with 16 bit or 32-bit which people have been using in the past.”
Barriers to 8-bit operations?
The Linley Group’s Gwennap said most recent GPUs and AI chips support 16-bit floating point (FP16) using the IEEE-defined format.
However, he added, “Despite this, most developers are still training neural networks using FP32.” He said, “The problem with 8-bit FP is that there is no standard format, although there are only a few possible combinations of exponent and mantissa that make sense. Until there is a standard (either IEEE or some informal agreement), chip makers would find it difficult to implement efficiently in hardware.”
We asked Welser how long before the commercial world starts using 8-bit precision for training. He said he couldn’t know because “we are seeing an uptick right now in people using 16-bit technology for the first-time cases, but the bulk of the industry still looks at 32-bit…”
However, he stressed that he doesn’t see any real barriers to the reduction of precision, “as long as we are able to show the results that give equivalent output.” He noted, from users’ point of view, “If the chip that goes faster, uses less power and [is] less expensive and I get the same answer out, I don’t care.”
Of course, underneath, modifications in software infrastructure must come into play.
Welser confirmed, “You’ve got to have software or algorithm that knows you are working with reduced precision, so that they can run things correctly.” With all software architecture today built for using GPU and 32-bit, “all that must be modified to accept 16-bit or 8-bit.”
Until users have access to real hardware, the industry is likely to stick to what they know already.
8-bit precision in-memory multiplication
IBM presented this week at the IEDM what the company described as 8-bit precision in-memory multiplication with projected phase-change memory.
At the IEDM, IBM scientists have published research on a new type of in-memory computing device which can compute between 100-1000 times lower energy levels when compared to today's commercial technology. The device is ideally suited for AI applications at the edge, such as autonomous driving, healthcare monitoring, and security. (Source: IBM Research)
The engineering community is already aware that a key to reducing energy consumption is to minimize the occasions in computing architecture when data must be moved from memory to processors for calculations. Such movement takes a lot of time and energy.
The need for more efficient AI processing has prompted many to look into in-memory computing. Among AI chip startups pursuing this, Mythic stands out, about whom EE Times has reported. But there may be more.
In Welser’s opinion, analog technology is “a natural fit for AI at the edge.” As observed in the history of computing, analog computing required low power, proving it highly energy efficient. But it was also inaccurate. “That’s how digital computing eventually won out over analog computing,” Welser said.
But analog is coming back, as “In-memory compute work with analog computing,” said Kevin Krewell, principal analyst at Tirias Research. He explained, “The memory array holds the neural net weights and the analog elements perform the summation and trigger.”
Krewell added, “The challenge is keeping the analog properly calibrated and accurate of process and temperature variations. Also, the memory and analog elements don't scale as well as digital elements.”
Weights are resistant values in memory
Similarly, Welser explained that weights used in neural nets in analog computing are “resistant values that sit right there in memory. They don’t have to be moved in and out. They are all set.” In other words, with in-memory computing architecture, “memory units moonlight as processors, effectively doing double duty of both storage and computation,” Welser said.
The challenge, though, said Welser, is: “What is this resister we are going to use, that would allow us to set it at various resistance levels as we do training? It must be accurate enough to be useful.”
While digital AI hardware races to reduce precision, analog has thus far been limited by its relatively low intrinsic precision, impacting model accuracy, Weiser explained.
In developing the ability to approach 8-bit precision, IBM used phase-change memory (PCM). PCM has been used in analog memory, said Welser. In this case, “We are using PCM to store many more different resistance levels back and forth. More importantly, we are using a novel architecture around it,” he noted.
IBM’s paper details the technique that achieved 8-bit precision in a scalar multiplication operation. This resulted in “roughly doubling the accuracy of previous analog chips, and consumed 33x less energy than a digital architecture of similar precision,” the company claimed.
Gwenapp acknowledged that IBM has been working on PCM for some time, but he called it still “just a research project.”
Gwenapp sees the biggest challenge for the PCM approach as manufacturability. “Analog characteristics vary from transistor to transistor and from chip to chip on a production line, which is why most of the industry uses digital circuits that are less susceptible to this variation.”
EE Times asked both the Linley Group and IBM about the status of commercial AI chips — such as Mythic — that use in-memory computing. Gwennap said, “Mythic seems to be the closest to bring this technology to production, but even they are at least a year away.”
IBM acknowledged, “Mythic has an interesting approach focused on using in-memory computing.” However, Big Blue noted that Mythic’s chip is “ONLY for inference applications.”
According to an IBM spokesperson, the IBM difference is: “We believe that a complete AI solution requires accelerating both inference and training. We are developing and maturing non-volatile memory elements that can be used for BOTH inference and training.”
— Junko Yoshida, Global Co-Editor-In-Chief, AspenCore Media, Chief International Correspondent, EE Times