

# **Towards Scalable Quantum Simulation on Wafer-Scale Engines**

### **Experimental Setup**

#### **Analysis Platform**

- Cerebras' WSE simulator used to run and profile our method Counts clock cycles required for entire GEMV operation
- Simulates variable number of PEs
- Can't simulate large (WSE-3 scale) PE grids
- We demonstrate scalability by varying PE grid size
- Compare results for 1 PE up to 128 x 82 PEs (1.3% WSE-3) • WSE-3 aspect ratio of 1.555:1 maintained
- Results compared against Qiskit Aer simulator

#### Input Data

- Input matrix and vector created with random values • QHT circuits created for realistic application Qiskit [11]
- Gray-scale images of size 8x8 to 64×64 pixels were used

- **Relevant Variables** • PE grid size - demonstrates scaling and distribution Memory availability - restricted to match smaller WSE size  $M_{p} \& N_{p}$  - controls job size and aspect ratio Input channels - always 16 used, except 1 for 1x1 PE • PE grid height is always a multiple of 16 to minimize IO irregularities influencing performance • More than 100 actually available on WSE-3 Algorithm - matrix buffering / vector buffering differences Qubit count (n) - number of qubits simulated **Performance Metrics** Throughput (GB/s) - calculated from clock cycle count Speedup - parallel advantage over a serial processor

---- Extrapolated

— 1x1 PEs

64x41 PE

96x62 PE

------ 128x82 P

96x62 PEs

48x31 PEs

**——** 1x1 PEs

30 1/64

50 + 1/4

< 30 ↓ → 16 ·

1

64

**—**4

1/16

---- Extrapolated

— 64x41 PEs

10.000.000

1,000,000

10,000

- PE Grid Size
- Data for 1x1 PE was extrapolated for n > 8
- Simulator SDK struggled to handle qubit count larger than 8 • Throughput stayed constant and close to 98MB per second At 12 qubits, 128x82 PE grid has a 671x speed advantage **5** 100,000 · Continuous performance improvement with more qubits
- Better parallel advantage with larger circuits
- Improvements with more PEs
- Significant given constant input channels • Larger variance with more qubits

- Memory Usage
- Larger allocation improves performance Larger buffers reduce job count and associated costs Less noticeable at the high end
- 48kB is suitable for methodology on WSE-3

### M<sub>p</sub>/N<sub>p</sub> Ratio (Matrix Buffering)

- 1:1 (square) and 1:4 perform best
- Much higher N<sub>p</sub> performs poorly
- Larger M<sub>p</sub> potentially scales better than 1:1
- Slower at low qubit counts, but just as fast at n = 12 Deserves a more detailed future investigation

#### M<sub>p</sub>/N<sub>p</sub> Ratio (Vector Buffering)

- Performance is strictly better with larger  $N_p$
- Benefits most from buffer re-use
- Likely related to slow output on WSE More extreme ratios should be explored

#### Algorithm Type

- Matrix buffering shows clear advantage at n = 10
- Matrix buffering benefits from input/output overlap • Output is less optimized, hurting vector buffering more
- Advantage diminishes by n = 12 Vector buffering may scale better with n > 12
- Methods are too similar overall to conclude that one is better than the other

#### QHT Image Fidelity

- Compared against Qiskit Aer and classicallyperformed GEMV for
- correctness
- Tested at 100%, verifying no loss from the simulated
- **A-P. A Original Image** { KU

**Original Image** 







### **Conclusion and Future Work**

#### In This Work

- We proposed a method for quantum simulation using the General Matrix-Vector product (GEMV) operation on Cerebras' Wafer-Scale Engines (WSEs)
- We demonstrated scalability and parallel advantage using WSE simulations
- Results indicate increasing parallel performance benefits with larger PE grids and qubit counts Method scales well to the size of a physical WSE
- We investigated optimal configurations for multiple variables of our method
- Block aspect ratio & buffering scheme
- Future Work
- Investigate optimizations to proposed techniques and implementations
- Specialize to specific quantum circuit types, accelerating the method to compete with other hardware platforms
- Sparse matrix optimization
- Reduced matrix value range (integers, 0s & 1s, etc.) • Tensor contraction & circuit pipelining
- Include quantum error correction (QEC) techniques
- Port to Cerebras WSE hardware

Acknowledgements: This research used resources of the Oak Ridge Leadership Computing Facility, which is a DOE Office of Science Use Facility supported under Contract DE-AC05-00OR22725.



## KUSCHOOL OF ENGINEERING

KU INSTITUTE FOR INFORMATION SCIENCES

**OAK RIDGE** National Laboratory







### **Results and Analysis**