

### March 20-21 2018SUMMIT San Jose, CA



# Accelerating Flash Memory with the High Performance, Low Latency, OpenCAPI Interface Allan Cantle, CTO & Founder, Nallatech/Molex Marcy Byers, Processor Development, IBM





### Nallatech at a Glance

### Server qualified accelerator cards featuring FPGAs, network I/O and an open architecture software/firmware framework. Design Services/Application **Optimisation**

- Nallatech a Molex company
- 25 years of FPGA heritage
- Energy-Efficient High Performance Heterogeneous Computing
- Real-time, low latency network and I/O processing
- Intel PSG (Altera) OpenCL partner
- Xilinx Alliance partner
- Server partners: Cray, DELL, HPE, IBM, Lenovo
- Application porting & optimization services
- Successfully deployed high volumes of FPGA accelerators









### Hyperconvergence vs Disaggregation – An unavoidable Oxymoron?

**Hyperconverged Architectures** 

- **CPU Centric Playbook >>**
- **Best Single Threaded Performance >>**
- Tightest of CPU/Accelerator Coupling **>>** 
  - » Holy Grail =  $\infty$  Bandwidth & 0 Latency
- Easier acceleration of Legacy Code **>>**
- PCIe is today's convergence bus **>>** » E.g. NVMe SSDs



Can we hyperconverge & disaggregate Flash Memory at the same time?

### **Disaggregated Architectures**

Data Centric Playbook **>>** 

- Heterogeneous & Distributed compute **>>**
- Prioritize Application Dataflow needs **>>**
- » Can put congestion back into the Network
- Latency managed, compute => data  $\rightarrow$
- Ethernet is today's disaggregation fabric **>>** » E.g. NVMe-oF



### Hyperconverged accelerations quest for Bandwidth & 0 Latency

- » Moving a single thread of data from t the acceleration benefit
  - » therefore Acceleration > overhead of data movement to/from accelerator
- » Partition code for minimum data movement and maximum acceleration



### Moving a single thread of data from the CPU to an accelerator can negate



### Hyperconverged tightly Coupled FPGA Acceleration is not new..... Intel, Xilinx, Nallatech & ISI collaborated on FSB & QPI attached Accelerators

- Started Circa 2007 **>>**
- Despite a decade of ongoing efforts, commercial reality has been elusive





### Accelerators – Open Attach Strategy



1. Open Ubiquitous Standards Based Approach PCle\* Gen1, PCle\* Gen2, and Geneseo – (Extend PCI Express\* Gen 2 - Joint Intel/IBM Proposal in PCI-SIG)

2. Enable third party FSB-FPGA Modules – targeted for FSI, Oil and Gas, Life Sciences, Digital Health, etc

FSB-FPGA Modules Targeted 4Q07/1Q08



3. Intel® QuickAssist Technology Accelerator Abstraction Layer that seamlessly allows the SW to access acceleration across various technologies.

**Open Standards Based Attach Strategy** 

SOURCE : http://rssi.ncsa.uiuc.edu/2007/docs/industry/Intel\_presentation.pdf



### Why are Tightly Coupled FPGA Accelerators so challenging?

### » Tied to complex proprietary coherent busses

- » Rapid cadence of bus standards
- » Limited interface documentation
- » Onerous licensing terms

**A** Nalatech

a **molex** company

- » Coherent busses not natively designed with FPGAs in mind
  - » Pushes limits of FPGA's capabilities
- » Heavy burden on FPGA resources for interface IP
  - Impacts performance, in particular latency **>>**
  - Reduces resources available for acceleration >>
- » Can drag down the performance of native CPUs using the same bus

# **OpenCAPI** addresses these issues



## Open Coherent Accelerator Processor Interface (OpenCAPI

# **OpenCAPI**<sup>TM</sup> **Overview Open Compute Project 2018**





### **OpenCAPI Key Attributes**

### **Standard System Memory**



- **1. Architecture agnostic bus Applicable with any system/microprocessor architecture**
- 2. Optimized for High Bandwidth and Low Latency
- 3. High performance **25G** interface design with zero 'overhead'
- 4. Coherency Attached devices operate natively within application's user space and coherently with host microprocessor
- 5. Virtual addressing enables low overhead with no Kernel, hypervisor or firmware involvement 6. Wide range of Use Cases and access semantics
- 7. CPU coherent device memory (Home Agent Memory)
- 8. Architected for both Classic Memory and emerging Advanced Storage Class Memory

# **Buffered System Memory**







### **Comparison of Memory Paradigms**









**OpenCAPI 3.1 Architecture** Ultra Low Latency ASIC buffer chip adding +5ns on top of native DDR direct connect!!

**Storage Class Memories have the potential to be** the next disruptive technology..... Examples include ReRAM, MRAM, Z-NAND..... All are racing to become the defacto

**Storage Class Memory tiered with traditional DDR** Memory all built upon OpenCAPI 3.1 & 3.0 architecture. Still have the ability to use Load/Store Semantics

### **Acceleration Paradigms with Great Performance**





Examples: Machine or Deep Learning such as Natural Language processing, sentiment analysis or other Actionable Intelligence using OpenCAPI attached memory



**Examples: Encryption, Compression, Erasure prior to** delivering data to the network or storage



Examples: Database searches, joins, intersections, merges Only the Needles are sent to the processor



**OpenCAPI WINS due to Bandwidth to/from** accelerators, best of breed latency, and flexibility of an Open architecture



**Examples: Video Analytics, Network Security, Deep Packet Inspection,** Data Plane Accelerator, Video Encoding (H.265), High Frequency Trading, etc



Examples: NoSQL such as Neo4J with Graph Node Traversals, etc

### TLx and DLx Reference Designs in an FPGA

- TLx and DLx will be provided as reference designs to OpenCAPI consortium members
  - along with RTL
- TLx and DLx are not symmetric with OTL and ODL that are on the host processor
- Designed to operate at 400MHz
- Xilinx Vivado 2017.1 TLx and DLx Statistics on VU3P Device

| <b>VU3P Resources</b> | CLB FlipFlops | LUT as Logic | LUT Memory  | Block Ram Tile |
|-----------------------|---------------|--------------|-------------|----------------|
| DLx                   | 9392/788160   | 19026/394080 | 0/197280    | 7.5/720        |
|                       | (1.19%)       | (4.82%)      | (0%)        | (1.0%)         |
| TLx                   | 13806/788160  | 8463/394080  | 2156/197280 | 0/720          |
|                       | (1.75%)       | (2.14%)      | (1.09%)     | (0%)           |

| FPGA  | Total kLUTs | Fabric Utilization |
|-------|-------------|--------------------|
| VU3P  | 394         | 8.1%               |
| KU15P | 523         | 6.1%               |
| VU9P  | 1182        | 2.7%               |



Associated reference design specifications for TLx and DLx will also be delivered



12

### **CAPI and OpenCAPI Performance**

|                   | CAPI 1.0<br>PCIE Gen3 x8<br>Measured<br>Bandwidth<br>@8Gb/s | CAPI 2.0<br>PCIE Gen4 x8<br>Measured<br>Bandwidth<br>@16Gb/s |
|-------------------|-------------------------------------------------------------|--------------------------------------------------------------|
| 128B DMA<br>Read  | 3.81 GB/s                                                   | 12.57 GB/s                                                   |
| 128B DMA<br>Write | 4.16 GB/s                                                   | 11.85 GB/s                                                   |
| 256B DMA<br>Read  | N/A                                                         | 13.94 GB/s                                                   |
| 256B DMA<br>Write | N/A                                                         | 14.04 GB/s                                                   |
|                   | POWER8<br>Introduction<br>in 2013                           | <b>POWER9</b><br>2 <sup>nd</sup> Generation                  |





### **OpenCAPI 3.0**

25 Gb/s x8

Measured Bandwidth @25Gb/s

22.1 GB/s

21.6 GB/s

22.1 GB/s

22.0 GB/s

**POWER9 Open Architecture with a Clean Slate Focused on Bandwidth and Latency** 



### Latency Ping-Pong Test

Simple workload created to simulate communication between system and attached FPGA

Bus traffic recorded with protocol analyzer and **PowerBus traces** 

Response times and statistics calculated

### Copy 512B from cache to FPGA 1. 2. Reset poll location 3. Repeat 4. Poll on 512B received from host 1. Reset poll location 2. DMA write 128B for cache injection 3. Repeat 4.









### Latency Test Results

### **776ns<sup>§</sup> Total Latency** 737ns<sup>§</sup> Total Latency Kaby Lake PCIe Gen3\* P9 PCle Gen3 3.9GHz Coro, 2.4GHz Nest 3.9GHz Core, 2.4GHz Nest 376ns 337ns **7ns Jitter** 31ns Jitter PCIe Stack PCIe Stack PCIe G3 Link PCIe G3 Link Altera PCIe HIP (400ns<sup>¶</sup>) Altera PCIe HIP (400ns<sup>¶</sup>) Altera FPGA Stratix V Altera FPGA Stratix V \* Intel Core i7 7700 Quad-Core 3.6GHz (4.2GHz Turbo Boost) + Derived from round-trip time minus simulated FPGA app time

‡ Derived from round-trip time minus simulated FPGA app time and simulated FPGA TLx/DLx/PHYx time



### Nallatech / Molex ASG – CAPI Flash Acceleration Product Timeline

CAPI 1.0 Bridge to IBM's Flash Drawer Power8 PCIe Gen 3 CAPI 1.0 FlashGT Power8 PCIe Gen 3 Storage Acceleration CAPI 2.0 FlashGT+ Power9 PCIe Gen 4 Storage Acceleration







Altera 5SGXA7 FPGA to Fiber Channel Interface Xilinx KU060 FPGA + 2x 1TByte M.2 NVMe SSDs

Xilinx KU15P FPGA + 4x 1TByte M.2 NVMe SSDs Or 4x cabled U.2 NVMe SSDs

**Introduced 2014** 

**Introduced 2016** 

**Introduced 2017** 



OpenCAPI Hyperconverged & Disaggregatable Flash Storage Accelerator for Zaius/Barreleye-G2 OCP Power9 platforms OpenCAPI, CAPI 2.0, PCIe 250-SoC Generic Storage Acceleration



Xilinx ZU19P MPSoC FPGA +Xilinx ZU19P MPSoC FPGA +8x 2TByte M.2 NVMe SSDs +4x PCIe G4x8 cabled Storage IO50GB/s Dataplane Fabric IO50GB/s Dataplane Fabric IO

Introduced 2018

**FPGA Accelerated Computing** 

**Introduced 2018** 





# Leveraging OpenCAPI as a Bridge to a Data Centric World



### **Data Centric Architectures - Fundamental Principles**

- 1. Consume Zero Power when Data is Idle
- 2. Don't Move the Data unless you absolutely have to
- 3. When Data has to Move, Move it as efficiently as possible
- Which Translates to : -
- 1. Use Non Volatile Memory where possible
- 2. Move the compute to the data
- 3. Leverage independent power efficient Dataplanes





### Data Center Architectures, blending evolutionary with revolutionary







### Molex ASG HyperConverged & Disaggregatable Server

- Leverage Google & Rackspace's OCP Zaius/Barreleye G2 platform **>>**
- Reconfigurable FPGA Fabric with Balanced Bandwidth to CPU, Storage & Data Plane Network  $\rightarrow$
- OpenCAPI provides Low Latency & coherent Accelerator / Processor Interface  $\rightarrow$
- GenZ Memory-Semantic Fabric provides Addressable shared memory up to 32 Zetabytes **>>**





### Molex ASG Flash Storage Accelerator, FSA, in Barreleye-G2 OCP Server

- Xilinx Zynq US+ 0.5OU High Storage Accelerator Blade **>>**
- 4 FSAs in 20U Barreleye-G2 OCP Storage drawer deliver :->>
  - 152 GByte/s PFD\* Bandwidth to 1TB of DDR4 Memory **>>**
  - 256 GByte/s PFD\* Bandwidth to 64TB of Flash **>>**
  - 200 GByte/s PFD\* Bandwidth through the OpenCAPI channels **>>**
  - 200 GByte/s PFD\* Bandwidth through the GenZ Fabric IO **>>**
- Open Architecture software/firmware framework **>>**









# Summary

### » The OpenCAPI interface standard is a perfect compliment to the OCP Initiative bringing best in class features including :-

- » Coherency
- » Lowest Latency
- » Highest Bandwidth
- » Open Standard
- » Perfect Bridge to blend CPU Centric & Data Centric Architectures
- » Simultaneous Hyperconverged & Dissagregatable Flash Memory solutions can be built without performance compromise
- » OCP, OpenCAPI & FPGA Acceleration are now bringing highly optimized Data Centric server architectures closer to reality







# JOIN TODAY! www.opencapi.org

# Come see us in the Expo Hall OpenCAPI Booth C5

# OPEN. FOR BUSINESS.



