Los Gatos, California, based startup CacheQ Systems has launched a high-level language (HLL) heterogeneous compute development environment aimed at software developers with limited knowledge of hardware design to deliver high performance processor and field programmable gate array (FPGA) compute architectures.

The company said its newly released QCC Acceleration Platform simplifies processor and FPGA application development and enables applications to be implemented in days compared to a typical nine- to 12-month schedule.

Co-founded by CEO Clay Johnson and CTO Dave Bennett, who were both at Xilinx for 16 years, their QCC Acceleration Platform provides a HLL software development platform for heterogeneous compute architectures. Based on the company’s proprietary CacheQ Virtual Machine (CQVM), a complete application representation can be analyzed, partitioned, optimized and targeted to a variety of compute engines, including x86, Arm and RISC-V, and FPGAs. Input is HLL code that is generated into CQVM, then optimized and partitioned. The final result is compute executables from the partitioned CQVM.

The CQVM enables extensive analysis and optimization prior to compute executable generation. Software developers can perform performance simulation, profile the compete virtual machine, view the CQVM to examine partitioning results and hot spots and analyze compute resource utilization.

“Demand for hardware acceleration beyond x86 is tremendous. Our goal is to simplify high-performance data center and edge-computing application development. The QCC Acceleration Platform meets that goal and will enable new solutions across a variety of applications, including life sciences, financial trading, government, oil and gas exploration and industrial IoT.”

According to CacheQ, existing FPGA solutions have evolved over the last 30 years and focus solely on hardware designers, not the needs of software developers, so its acceleration platform is meant for software developers with limited knowledge of hardware architecture. While existing technologies require a variety of tasks to be done by hardware designers, the platform’s fully pipelined implementations are complimented with a custom many-port pooled memory architecture.

Johnson told EE Times that the company’s virtual machine is a complete representation of the algorithm, and no one else does this. It then integrates this with the memory system. He added, “Partitioning is important, and we do this fully automated or guided. We also automatically pipeline everything so that the developer doesn’t have to.” He said the other key feature is code unrolling.

A bad partition can negatively impact performance, and the partition has to change as an application evolves to optimize performance and limit data traffic. Requiring the user to develop compute engine-specific code will yield inferior results. CacheQ said software developers write one application using its platform because it automatically partitions an application across compute elements that can be combinations of processors and FPGAs.

CacheQ

The CacheQ development flow (Image: CacheQ Systems)

In order to achieve performance in FPGAs, loops need to be pipelined. Other methods to achieve acceleration without pipelining are limited. Processor loop execution time is (N*C)/(clock rate) where N is the number of iterations and C is the number of cycles. For FPGAs, fully pipelined execution time is a (N+C)/(clock rate). The QCC Acceleration Platform’s inherent capability automatically pipelines all loops. For more acceleration, pipelined loops can be unrolled to deliver greater acceleration through a simple command line option with no code modification.

While pipelining and loop unrolling create simultaneous accesses to memory, parallel operation execution stalls without sufficient memory access/bandwidth. Traditional FPGA development requires users to rewrite their code and guarantee predictable memory access, an unsurmountable challenge for most applications. The answer is tight integration with the memory subsystem and application code to deliver performance and reduce development time. CacheQ said its platform’s proprietary multi-port arbitrated cached memory subsystem integrates with CQVM to deliver up to 100 memory ports and terabytes of memory bandwidth. malloc (C dynamic memory allocation) and complex pointer references are also fully supported.

The QCC Acceleration Platform is shipping now in limited volume. The initial release supports FPGA accelerator boards from Alpha Data, Bittware and Xilinx. Support for processor and FPGA system on chip (SoC) boards will be available later in the year. Johnson told EE Times that its initial customers are using the platform for applications in weather simulation, embedded manufacturing and government.

While the company’s immediate plans are to engage with key customers to target heterogeneous compute using processors and FPGAs, longer term it hopes to provide a development and orchestration platform for distributed heterogeneous computing. Distributed in this context, Johnson said, means in the data center or at the edge.