Chinese startup Biren has emerged from stealth, detailing a large, general-purpose GPU (GPGPU) chip intended for AI training and inference in the data center.
At Hot Chips, Chinese startup Biren has emerged from stealth, detailing a large, general-purpose GPU (GPGPU) chip intended for AI training and inference in the data center. The BR100 is composed of two identical compute chiplets, built on TSMC 7 nm at 537 mm2 each, plus four stacks of HBM2e in a CoWoS package.
“We were determined to build larger chips, so we had to be creative with packaging to make BR100’s design economically viable,” said Biren CEO Lingjie Xu. “BR100’s cost can be measured by better architectural efficiency in terms of performance per watt and performance per square millimeter.”
The BR100 can achieve 2 POPS of INT8 performance, 1 PFLOPS of BF16, or 256 TFLOPS of FP32. This is doubled to 512 TFLOPS of 32-bit performance when using Biren’s new TF32+ number format. The GPU also supports other 16- and 32-bit formats but not 64-bit (64-bit is not widely used for AI workloads outside of scientific computing).
Using chiplets for the design meant Biren could break the reticle limit but retain yield advantages that come with smaller die to reduce cost. Xu said that compared with a hypothetical reticle-sized design based on the same GPU architecture, the two-chiplet BR100 achieves 30% more performance (it is 25% larger in compute die area) and 20% better yield.
Another advantage of the chiplet design is that the same tapeout can be used to make multiple products. Biren also has the single-chiplet BR104 on its roadmap.
High-speed serial links between the chiplets offer 896-GB/s bidirectional bandwidth, which allows the two compute tiles to operate like one SoC, said Biren CTO Mike Hong.
As well as its GPU architecture, Biren has also developed a dedicated 412-GB/s chip-to-chip (BR100 to BR100) interconnect called BLink, with eight BLink ports per chip. This is used to connect to other BR100s in a server node.
Each compute tile has 16 × streaming processor clusters (SPCs), connected by a 2D mesh-like network on chip (NOC). The NOC has multi-tasking capability for data-parallel or model-parallel operation.
Each SPC has 16 execution units (EUs), which can be split into compute units (CUs) of four, eight, or 16 EUs.
Each EU has 16 × streaming processing cores (V-cores) and one tensor core (T-core). The V-cores are general-purpose SIMT processors with a full-set ISA for general-purpose computing—they handle data preprocessing, handle operations like Batch Norm and ReLU, and manage the T-core. The T-core accelerates matrix multiplication and addition, plus convolution—these operations make up the bulk of a typical deep-learning workload.
Biren has also invented its own number format, E8M15, which it calls TF32+. This format is intended for AI training; it has the same-sized exponent (same dynamic range) as Nvidia’s TF32 format but with five extra bits of mantissa (in other words, it is five bits more precise). This means the BF16 multiplier can be reused for TF32+, simplifying the design of the T-core.
Xu said the company has already submitted results to the next round of MLPerf inference scores, which should be available in the next few weeks.
This article was originally published on EE Times.
Sally Ward-Foxton covers AI technology and related issues for EETimes.com and all aspects of the European industry for EETimes Europe magazine. Sally has spent more than 15 years writing about the electronics industry from London, UK. She has written for Electronic Design, ECN, Electronic Specifier: Design, Components in Electronics, and many more. She holds a Masters’ degree in Electrical and Electronic Engineering from the University of Cambridge.