Makes more room for larger ARM, DSP, I/O and inference blocks
SAN JOSE, Calif. — Xilinx released the first details of its next-generation Everest architecture, now called Versal. It shows the microprocessor landscape is blurring as CPUs, GPUs and FPGAs morph into increasingly similar SoC-like devices.
Versal shrinks the size of a central FPGA block to make room for more ARM, DSP, inference and I/O blocks. It comes as Intel and AMD make room for beefier GPUs in their x86 chips and Nvidia adds specialty cores for jobs like deep learning on its GPUs.
Xilinx positioned Versal as the start of a broad new family of standard products. They aim to outperform CPUs and GPUs on a wide range of data center, telecom, automotive and edge applications and increasingly support programming in high-level languages such as C and Python.
At a time when Moore’s law is slowing, Versal is Xilinx’s effort to “step up its game to be a peer of Intel and Nvidia,” giants three to 20+ times its size, said analyst Kevin Krewell of Tirias Research.
Under the covers, Versal sports a network-on-chip (NoC) based on hardened AXI blocks and a management controller aimed to deliver new levels of programmability and ease of use. It also supports a new homegrown inference accelerator.
Xilinx released a handful of benchmarks based on simulations showing initial 7nm chips can beat 16-12nm CPUs and GPUs. First chips will tape out later this year, and initial Versal products will arrive in the second half of 2019.
Like the silicon, the software is still a work in progress. The company will announce Tuesday new C-language capabilities for Versal and its existing 16nm FPGAs on data center applications. A unified software environment for high-level languages is about a year away.
Versal’s network-on-chip NoC uses hardened AXI blocks to let programmers define in software memory hierarchies mapped to hardware registers. It also lets them check programming paths and define NoC master/slave connections for security.
The look-up table (LUT) block at the core of Versal can be dynamically configured at 8x the speed of today’s FPGAs. However, its underlying architecture is similar to today’s FPGAs with a 10-20% speed up due to the 7nm process and a broader range of LUT/memory ratios supported.
Xilinx designed a new VLIW/SIMD vector processor to accelerate AI jobs in Versal, running at more than a GHz in arrays of tens to hundreds of cores. The largest arrays initially include 400 cores with 32 KBytes/memory each and a total of 128 KBytes including blocks on nearest neighbors directly addressable.
The arrays will not support Bfloat-16, a hot precision level promoted by data center giants such as Google and Microsoft. However, such work could be handled by the traditional FPGA block.
Initial 7nm Versal products include a:
- Dual-core Cortex-A72
- Dual-core Cortex-R5
- Multi-terabit/s NoC
- Management controller to handle boot up
- DSPs and/or vector arrays
They support interfaces such as:
- DDR4-4300 and LPDDR4-4266
- 112G serdes in Premium versions in 2020
- High bandwidth memory in late 2021
- PCIe Gen 4 x16
- CCIX and AXI-DMA
- MIPI D-PHY
Xilinx defined six families of Versal chips, two shipping late next year called Prime and AI Core. Three more will follow in 2020—AI Edge, AI RF and Premium—with a version supporting HBM expected in late 2021.
The Prime family comes in nine configurations and has the broadest range of target markets. The AI Core version has five members, and Xilinx detailed all 14 variants on its Web site.
The chips span 5-150W in power consumption. They will support 0.7, 0.78 and 0.88V. Prices will align with traditional Xilinx FPGAs, although they are likely to get more competitive at the low end and for hot use cases such as self-driving cars.
“In ADAS and small power envelopes we will go up against the likes of [Intel] Mobileye in performance/dollar,” said a Xilinx representative.
In a networking simulation, a Prime series delivered 150 million packets/second on an open virtual switch application versus 8.68 million for a traditional CPU. The AI Core version delivered 43x the inference performance on convolutional neural nets of a CPU and 2x of a GPU in high batch sizes.
At sub-two millisecond latencies targeting edge apps, AI Core beat a GPU by 8x in performance simulations and 4x in throughput at 75W, Xilinx claimed. The AI parts are optimized to run a mix of inference and other applications, not just deep learning.
Just how users fare programming the multi-headed Versal remains to be seen. A handful of early customers have been given access to simulation materials. However, it could take more than a year before they have extensive hands-on experience with final developer tools and libraries
Microsoft was an early adopter of FPGAs in its data center servers, using parts from rival Altera, now part of Intel. At that time, Microsoft engineers were vocal that the RTL programming style of FPGAs was a chief challenge.
“Xilinx has to develop an ecosystem around it, and it takes time for developers to understand how these things will be used,” said Karl Friend, an analyst with Moor Insights & Strategy.