High-Performance SoC Design

High-Performance SoC Design

  • CPU design with homegrown architecture and ideal PPA
  • CPU hardening for off-the-shelf CPUs to achieve target PPA
  • Expertise in super-scalar, out-of-order processors
  • Decades of expertise in multi-core SoC design
  • High-performance SoC integration with performance-critical IPs
  • MCU, DDR, PCIe, Ethernet, SATA

Apex’s Mantra for High-Performance Design

 

The Apex Semiconductor team has delivered numerous high-performance (in terms of instructions per cycle, or IPC) and high-frequency (3GHz+) micro-processor cores. A given Instruction Set Architecture (ISA) can be micro-architected in many different ways, each producing different power, performance, and area (PPA) targeted for specific market segments. The development of these high-performance chips requires a very carefully crafted design, validation, and implementation methodology. At Apex, we’ve built a homegrown recipe for high-performance micro-processor design that has proven to be successful across many ISAs, technology nodes, and end applications.

In the conventional ASIC chip design process, the RTL is written first and provided to the back-end team, which then implements the RTL to a chosen floorplan and performs place and route steps. Critical timing paths for these designs can be managed by adding flip-flops where needed, and performance degradation from adding these flip-flops is not a primary concern. In these conventional designs, the micro-architecture can be decoupled with the implementation by bounding the design units with flip-flops, making these units largely independent in terms of design closure.

High-performance micro-processor design needs a fundamentally different approach.  An aggressive, yet balanced micro-architecture must be carefully designed before the detailed front-end or back-end implementation work even begins. Early validation and feasibility studies of the micro-architecture are a very essential part of this process. These studies include developing an early performance model to validate the design choices made for various cache sizes, execution slices, branch prediction schemes, scheduler structures, different buffer sizes, and more. In addition to the performance model, an early implementation feasibility study is also essential to validate the floorplan and pipelining stages in the micro-architecture.

In Apex’s design flow, multiple feasibility studies are performed before a single line of RTL code is even written. This early analysis allows designers to discover performance and critical timing path pitfalls that must be addressed early in the design. The RTL is then coded with a strict budget of normalized gates (NGs) for every pipelining stage. Wire delays must be accounted for in every pipelining stage and included in the NG budget for the cycle. Our homegrown high-performance chip design flow is catered towards the unique challenges encountered in high-performance microprocessor design.

Cores with Different ISA

Extensive experience in core design for various ISAs:

  • X86
  • ARM
  • RISC-V
  • Power PC
  • UltraSparc

Expertise in CPU implementation for both homegrown architecture and CPU hardening.

Design Feasibility: Case 1

 

Goal: Load Store Unit (LSU) – Achieve Load to Use Latency of 5

Operations to be performed:

Create the memory address from operand flops using high-speed adders/subtractors

Extend the memory address as needed and multiplex it with constants

Send the lower address bits of the address to pick the correct way from the multi-way associative cache

Setup the index bits to the memory macro

Perform the memory read, format the data with rotators, and multiplex it with other data

Return load data, bypass it to operand flops, and write it into the register file

Feasibility considerations

Build a Kogge-Stone Ling adder for fast addition

Perform floorplan iterations to estimate and minimize the distance travelled to the way select macros

Perform floorplan iterations to estimate and minimize the distance travelled to the data cache macros

Build 64-bit fast muxes with built-in priority for timing-critical inputs

Perform floorplan iterations to estimate and minimize the distance travelled from the data cache macros back to Register file and to the operand flops

Develop a pipeline proposal to ensure that all of the above is completed while running at the target frequency.