Processing Bends Toward AI

Article By : Brian Santo

From ISSCC: Google revealed an AI that excels at a key element of ASIC design. ISSCC plenary speakers describe how profoundly AI is affecting the entire semiconductor industry.

SAN FRANCISCO –Google is experimenting with machine learning (ML) to perform place-and-route in IC design and is getting great results. The revelation, announced last week at the ISSCC conference held here, is as important for artificial intelligence (AI) as it is for circuit design.

AI has been the most massive thing in the electronics sector for years, pulling an extraordinary amount of semiconductor research in its direction (along with venture capital and headlines). Acknowledging the obvious, the theme of this year’s Integrated Solid-State Circuits Conference (ISSCC) was “Integrated Circuits Powering the AI Era,” and the opening plenary session was constructed to map the extent to which AI has warped semiconductor space.

The four plenary speakers explained how the requirements of AI are, for example, driving a new category of processors architected specifically for AI applications (alongside CPUs and GPUs); spurring innovations in structure (e.g., chiplets, multichip packages, interposers); and are even influencing the development of quantum computing.

The plenary session’s first speaker was Jeff Dean, the lead at Google AI. Dean delivered an update of the overview of machine learning (ML) that he’s been presenting in one form or another for more than a year to lead into the discussion of the ML place-and-route tool.

Results of a human expert at placing and routing an ASIC design versus the results from a low power ML accelerator chip. Google deliberately obscured parts of the images. (Source Google Research / ISSCC)

He started with a quick overview of the history of AI and ML, starting with machines that learned how to play backgammon in 1995, and running through machines that learned to excel at chess, and then at go, and can now negotiate complex video games such as StarCraft “with remarkable success.” ML is also being used in medical imaging, robotics, computer vision, self-driving vehicles, neuroscience (analyzing microscopy of brain scans), agriculture, weather forecasting, and more.

The basic idea that drove computing for decades is that the bigger the problem, the more processing power you throw at it, and the more processing power you have, the bigger the problems you can solve. For a while, that applied to problem-solving with AI.

Where that broke down was when problem spaces got so mind-bogglingly vast it was simply not possible to amass enough CPUs (and/or GPUs) to solve them.

It turned out that AI/ML doesn’t need typical CPU/GPU power, however. The math required can be simpler and requires much less precision. That realization had practical ramifications: processors dedicated to AI/ML don’t have to be as complex as CPUs/GPUs.

That’s one of the fundamental insights that led to specialized processors designed for inference, such as Google’s own TensorFlow processors, now in their third generation. Google, by the way, is commonly expected to come out with a fourth generation TensorFlow one of these days, but if anyone had hoped Google would reveal anything about it at ISSCC, those hopes were dashed.

The realization that less precision is necessary for inference was followed by the realization that less precision is needed for training as well — that is relatively new. EE Times editor Sally Ward-Foxton explained the concept in her recent blog Artificial Intelligence Gets Its Own System of Numbers.

AI/ML processors can be relatively simple, and therefore relatively cheaper, and we now have AI/ML processors that are powerful enough to train pretty rapidly, even on enormous data sets. All of that is making it easier to push machine learning farther out into the network edge, Dean explained. A specific example is speech recognition; Dean said that as of 2019 Google has had a pretty compact model that works on smartphones.

Each AI application — autonomous driving, medical imaging, playing go — results from tweaking a dedicated AI/ML system to learn each. We have basically one AI per application. The next question was: is it possible to take an AI that learned one thing, and then see if it can apply what it’s learned to some other task that is similar?

“I brought this up because we began thinking about using this for place-and-route in ASIC design,” Dean said. “The game of place-and-route is far bigger than the game of go. The problem size is larger, though there isn’t as clear goal as there is with go.”

Google created a learning model for place-and-route, and then set out to find if the tool could generalize. Could it take what it learned on one design and apply it to a new design it had never seen before? The answer was an unambiguous “yes.”

Furthermore, Dean said, “We’ve gotten super-human results on all the blocks we’ve tried so far. It does a little bit better, and sometimes significantly better than humans.”

Google compared the results in the performance of an AI that used machine learning (ML) to teach itself to place and route components of an ASIC. The test circuits were several different blocks, including an Ariane RISC-V CPU. Google compared the performance of the same ML after progressive intervals of additional tuning, all against the performance of a commercial tool. (Source: Google Research / ISSCC)

“Better” includes performing place-and-route in extraordinarily less time. It might take a human expert weeks and weeks to accomplish the task. An ML placer typically does the same job in 24 hours, and its layouts typically have shorter wirelengths, Dean reported. The ML placer also did well against automated place-and-route tools. (Read more about ML and place-and-route in “Machine learning in EDA accelerates the design cycle,” written by Cadence’s Rod Metcalfe, in EE Times’ sister publication EDN.)

ML might also be extended to other parts of the IC design process, Dean said, including using ML to help generate test cases that more fully exercise state space in ASIC design verification, and perhaps also using ML to improve high-level synthesis to get to more optimized designs from high-level descriptions.

What all this means for ML, however, is as important as what it means for accelerating IC design schedules. If an ML can generalize within a category, can it generalize to perform tasks in other categories?

“What might future ML models look like?” Dean asked. “Can we train one model to generalize to similar tasks? Ideally we’d want one model that can learn to do thousands or millions of tasks.”

The artificial intelligence Internet of things (AIoT)

Kou-Hung “Lawrence” Loh, the senior vice president and chief strategy officer at MediaTek spoke of how AI is transforming just about everything connected to the Internet, and that the AI Internet of things (or AIoT) will rapidly expand from the tens of billions of devices today to encompass an estimated 350 billion devices worldwide by 2030.

AI is moving toward the edge in part because it can (as Dean mentioned earlier in the session) and because in many cases it has to, for several reasons including alleviating the growing processing burden on data centers, minimizing the traffic on networks, and because some applications require, or will work best, with local processing.

Local processing will have to be fast, it will have to be designed specifically for AI computation, and it will have to be extremely energy-efficient.

These are by nature a new category of processor. Loh called them AI processor units (APU). Others have referred to them variously as neural processing units (NPU), brain processing units (BPU), and other names. An APU might be less flexible than a CPU, for example, but by virtue of being purpose-built, APUs can be as much as 20 times faster at 55 times less power, he said.

Loh said that APU developers are working on devices that will reach 1 TOPS at 3 TOPS/Watt. He said he believes 10 TOPS at 10 TOPS/W is achievable. It might eventually be possible to get to 100 TOPS at 30 TOPS/W, he said.

Not coincidentally, MediaTek researchers presented at ISSCC a separate paper proposing a “3.4 to 13.3TOPS/W 3.6 TOPS Dual-Core Deep Learning Accelerator for Versatile AI Applications in a 7nm 5G Smartphone SoC.”

That’s at 7nm. Performance improvements will be gained by racing along the curve of Moore’s Law to smaller process nodes for at least one more step, from the present 7nm to 5nm. Moore’s Law still applies, Loh said.

Not without caveats, however. Transistor counts are increasing with integration, continuing to follow the classic Moore’s Law curve, “but the cost per transistor is not following,” Loh said. Furthermore, due to the complexity of chip design, and because process steps are getting more complicated, costs for leading-edge devices are soaring, prohibiting smaller companies from using the technology. There are also yield issues.

A common solution to many of these problems is splitting the die, Loh said. As a practical matter, that might mean using approaches such as chiplet technology. “It can lead to doing better than Moore’s Law,” he said. Whether it’s chiplets or some other architectural approach, it all means more challenges in interconnect.

System technology “co-optimization”

Nadine Collaert, program director at Imec, brought the plenary theme forward the next step, going over the need to separate die and figure out alternative structures and architectures for integrated circuits in the future. She called it system technology co-optimization, or STOC.

Moore’s Law is likely to pertain for years to come, but scaling CMOS is getting more challenging, she said. She illustrated the point with a series of examples of ever-more-complicated device structures, including (but hardly limited to) FinFETs, nanosheets, and forksheets, that can indeed be used to achieve further CMOS scaling at the chip level.

Imec demonstrated the ability to grow an unspecified III-V material on a silicon-on-insulator (SOI) substrate in a nano-ridge formation. (Source: Imec / ISSCC)

Something eventually has to give, however, she explained. A new approach is needed and “we believe 3D technologies are the best way. That includes multi-die packages, using bonding or, even at the device-level, fine-grade connections with other standard cells.”

Figuring out which technology to use will require matching system requirements against the properties of the options available. “That’s going to be a complicated exercise,” Collaert said. That is going to put pressure on EDA vendors to provide tools that will enable designers to weigh their options.

Front-end modules for wireless communications systems are going to be a particular challenge. “Generally these are the most diverse systems — they have many different components with different technologies, and that complexity will increase with more antennas, more PAs, more filters…”

The industry is moving to higher frequencies and higher efficiency. One option is combining III-V materials (e.g. GaN and SiC) with CMOS to get the benefits of both materials. That can be done with 3D integration, she said, showing several examples including an image of a 3D nano-ridge with a III-V material grown on a silicon-on-insulator (SOI) substrate, “but there’s a lot of work that needs to go into enabling this.”

As for memories? “New apps like AI and ML are driving the roadmap,” Collaert said. They need fast-access memories. “There’s a push to look at compute in memory, and as you bring logic and memory closer, 3D packaging is of course very important.”

Moving forward, using flash in advanced applications will be about stacking more tiers, she said. There’s also a desire to improve channel current in these memories. “To do that, we have to look at channel mobility, and again, that means looking at III-V materials.” And by extension looking at 3D architectures that stack a layer of silicon with a layer of a III-V material.

Meanwhile, in DRAMs, capacitors are growing from squat cylinders to pillars — yet another shift in the third dimension. Other memory options include magnetic memory for cache replacement, and 3D storage class memory — Collaert noted that Imec has demonstrated a vertical FeFET (ferroelectric field effect transistor) that still needs more research.

The development of all of these memories, she said, “is all in the context of machine learning. AI is booming. A lot of this is in the cloud, but for various reasons we want to move it to the edge, where there will be constraints on energy.”

Imec is more optimistic than MediaTek, in that it believes it might be possible to get to 10,000 TOPS/W.

“Scaling continues. The party is not over!” she concluded. “New memories might not make it into the roadmap, but they may have applications in machine learning.”

Quantum computing

Dario Gil, director of IBM Research, wrapped the plenary by addressing “what’s next,” which he said is generalized AI, which will be almost certainly achieved on quantum computers. That said, the key thrust of his talk is that the greatest benefits will probably be derived from the complementary use of bits (digital processing), neurons (AI) and qubits (quantum computing).

He noted that IBM opened access to its first quantum computer through the cloud in 2016, and that it now has access to 15 quantum computers available, including its latest 53-qubit model.

Leave a comment