ModSRAM: Algorithm-Hardware Co-Design for Large Number Modular Multiplication in SRAM

Abstract

Elliptic curve cryptography (ECC) is widely used in security applications such as public key cryptography (PKC) and zero-knowledge proofs (ZKP). ECC is composed of modular arithmetic, where modular multiplication takes most of the processing time. Computational complexity and memory constraints of ECC limit the performance. Therefore, hardware acceleration on ECC is an active field of research. Processing-in-memory (PIM) is a promising approach to tackle this problem. In this work, we design ModSRAM, the first 8T SRAM PIM architecture to compute large-number modular multiplication efficiently. In addition, we propose R4CSA-LUT, a new algorithm that reduces the cycles for an interleaved algorithm and eliminates carry propagation for addition based on look-up tables (LUT). ModSRAM is co-designed with R4CSA-LUT to support modular multiplication and data reuse in memory with 52% cycle reduction compared to prior works with only 32% area overhead.

 

Methodology
Image
modsram_algo

Algorithm Design: A 5-bit illustration of the first iteration in R4CSA-LUT dataflow. It achieves half iterations compared to an interleaved algorithm without carry propagation via carry-save addition. It is co-designed with ModSRAM architecture so that the operations are hardware-friendly and data can be reused through LUT.

Image
modsram_arch
The overall architecture of ModSRAM

Hardware Design: Above figure illustrates the overall architecture of ModSRAM. It is an SRAM PIM design with custom in/near memory computing circuits to execute the R4CSA-LUT algorithm, which aims to compute modular multiplication in 256 bits efficiently. ModSRAM consists of a 64x256 8T SRAM array with a read port and a write port. The in-memory computing (IMC) circuit is the logic-SA module used to implement XOR3 and MAJ bitwise logic function for carry save addition. The rest of the peripheral circuits include read wordline (RWL) and write wordline (WWL) decoders as well as near-memory computing (NMC) circuits. They are a radix-4 encoder, combinational logic for overflow, three D flip-flops (DFF) for sum, carry, multiplicand and a controller (Ctrl.).

Results
Image
modsram_area
Area breakdown on ModSRAM and full custom layout for SRAM array and in-memory circuit.

We evaluate ModSRAM using TSMC 65nm technology PDK. Fullcustom circuits including SRAM array and IMC modules are designed in Cadence Virtuoso. Digital circuits including WL decoders, NMC modules, and a controller are designed in Verilog, and synthesized in Synopsys Design Compiler. Simulations are done in both HSPICE as well as Verilog testbench to get the experimental results. A full-custom layout and synthesis result are included in the analysis to get the design area. The area breakdown and full-custom layout are shown in above figure.

Image
modsram_compare

The number of clock cycles for doing one modular multiplication is recorded in above table. For 256-bit, it can be done in 767 cycles with the clock frequency given as 420 MHz. R4CSA-LUT algorithm has a complexity of O(n), which scales linearly to bitwidth. The computation result is in the direct form, so no extra conversion cost is needed. The area achieved is small since it only demonstrates the operation of one modular multiplication. The area breakdown shows that the memory array occupies two-thirds of the whole design. SAs constitute most of the area in the in-memory circuits with the area of MUX as two transistors negligible. Since our design computes in-memory, the near-memory circuit is compact with very small WL decoders. ModSRAM induces only 32% area overhead by including near-memory circuits and two SAs.

Citation
@inproceedings{ku2024modsram,  
   title={ModSRAM: Algorithm-Hardware Co-Design for Large Number Modular Multiplication in SRAM},  
   author={Ku, Jonathan Hao-Cheng and Zhang, Junyao and Shan, Haoxuan and Samudrala, Saichand and Wu, Jiawen and Zheng, Qilin and Li, Ziru and Rajendran, Jeyavijayan and Chen, Yiran},  
   booktitle={Proceedings of the 61st ACM/IEEE Design Automation Conference},  
   pages={1--6},  
   year={2024}
}