Specialized Accelerators Enable Vector Processing on RISC-V

Article By : Charlie Cheng

Demand for special purpose domain specific accelerators can benefit from using RISC-V vector extension with high-bandwidth memory system.

When the RISC-V market first began, the initial rush was to cost reduce designs that would have otherwise used proprietary CPU instruction set architectures (ISAs) in deeply embedded applications. When these systems on chips (SoCs) began being fabricated in FinFET semiconductor process technology, the mask costs grew so expensive that many finite state machines were replaced with programmable micro sequencers based on the RISC-V instruction set. These created the initial excitement and later on the commoditization of simple RISC-V cores from 2014 to 2018.

As the RISC-V architecture became more mature and SoC designers became familiar with the ISA, it found adoption in real-time applications that demanded high performance: in particular, serving as a front end to highly specialized acceleration engines for applications such as artificial intelligence.  One key reason for this adoption is that RISC-V is an open architecture for users to add instructions, so the RISC-V processors did not have to treat the accelerators as memory-mapped I/O devices, as was the case for traditional architectures. Instead, they can use a low-latency co-processor.

The availability of RISC-V processors with vector extension enabled specialized accelerators to process the layers in between inner-loops of the kernel for applications such as artificial intelligence (AI), augmented reality/virtual reality (AR/VR), and computer vision. But this is not possible without purpose-built extensions such as a custom load instruction to bring data from an external accelerator into internal vector registers.

Driving this shift is the programming model demanded by these applications. The special-purpose accelerator — which is one large array of multipliers — is highly efficient, though rather inflexible, both in operations it performs and data movement. Contrast this with a general-purpose processor like the x86 that allows the programmer the ultimate flexibility to program without regard to the constraints of the compute engine — if only the design has 100W of power to burn, which most don’t.

The standard vector extension in RISC-V augmented with specialized custom instructions is an ideal companion to the accelerator (Image: Andes Technology) 

The obvious solution is to combine the flexibility of a general-purpose CPU with an accelerator that can handle a very specific task (see figure above). In RISC-V, the maturing standard vector extension augmented with specialized custom instructions is an ideal companion to the accelerator, and this adoption has become apparent in the past 18 months as domain-specific acceleration (DSA) solutions converge onto RISC-V platforms.

To make this vision possible, we have observed that the accelerator must be able to execute its own command set using its own resources including memory. To streamline the accelerator’s execution, the RISC-V should also be able to flatten out the microcode to as wide as necessary and pack all required control information to the accelerator in one command.  In addition, this accelerator command set should be aware of the RISC-V processor’s scalar registers and vector registers as well as its own resources such as control register files and memory.

When the accelerator needs help to reorder or manipulate data in special ways, Andes deals with this with a vector processing unit (VPU) to handle the complicated work of data permutations-shifting, gathering, compressing and expanding. In between layers, there are some kernels that involve complications. Here the VPU provides the flexibility to help address that need. In these sockets, the accelerator and the VPU both perform a huge amount of parallel computations; hence we added hardware to significantly raise the bandwidth of memory subsystem to match the computation demand, including but not limited to prefetch and non-blocking transactions with out of order return.

Andes Technology’s first RISC-V vector processor supporting the latest V-extension 0.8 version, the NX27V, performs each computation in the unit of 8-bit, 16-bit and 32-bit integers to 16-bit and 32-bit floating points. It also supports Bfloat16 and Int4 format to reduce storage and transfer bandwidth for weight values of the machine learning algorithms. The RISC-V vector spec is highly flexible in allowing the designers to configure the key design parameters such as  vector length, the number of bits in each vector register, and the SIMD width, the number of bits processed by the vector engine each cycle.

The NX27V has the vector length up to 512 bits and expandable to 4096 bits by combining up to eight vector registers. With added multiple functional units operating in parallel pipelines, it can sustain the computation throughputs needed in diversified applications. In an implementation configured with 512-bit vector length and the same SIMD width, it reaches 1 GHz speed in 7nm under worst case condition within an area of 0.3 mm2. For software development support, in addition to the compiler, the debugger, the vector libraries and the cycle simulator, a visualization tool for the NX27V pipeline, Clarity, helps analyze and optimize the performance of critical loops. This solution has already begun shipping in our early access program.

In the past 15 months, we have seen a great deal of demand for high performance with the addition of a powerful RISC-V vector extension, matching it with a high-bandwidth memory subsystem, and bringing the accelerator closer to the CPU. This is the type of computing requirement we believe will drive the demand for RISC-V and vector processing.

— Charlie Cheng is board of directors’ advisor for Andes Technology

Leave a comment