IVQ: In-Memory Acceleration of DNN Inference Exploiting Varied Quantization

TitleIVQ: In-Memory Acceleration of DNN Inference Exploiting Varied Quantization
Publication TypeJournal Article
Year of Publication2022
AuthorsF Liu, W Zhao, Z Wang, Y Zhao, T Yang, Y Chen, and L Jiang
JournalIeee Transactions on Computer Aided Design of Integrated Circuits and Systems
Date Published01/2022
Abstract

Weight quantization is well adapted to cope with the ever-growing complexity of the deep neural network (DNN) model. Diversified quantization schemes lead to diverse quantized bit-width and formats of the weights, thereby, subject to different hardware implementations. Such variety prevents a general NPU to leverage different quantization schemes to gain performance and energy-efficiency. More importantly, a trend of quantization diversity emerges that applys multiple quantization schemes to different fine-grained structures (e.g., a layer or a channel of weight) of a DNN. Therefore, a general architecture is desired to exploit varied quantization schemes. Crossbar-based Processing-In-Memory (PIM) architecture, a promising DNN accelerator, is well known for its highly efficient matrix-vector multiplication. However, PIM suffers from the inflexible intra-crossbar data path because the weight is stationary on the crossbar and bind to the “add” operation along the bit-line. Therefore, many non-uniform quantization methods must rollback the quantization before mapping the weights onto the crossbar. Counter-intuitively, this paper discovers a unique opportunity of the PIM architecture to exploit varied quantization schemes. We first transform the quantization diversity problem into a consistency problem by aligning the bit with same magnitude along the same bit-line of the crossbar. Consequently, such naive weight mapping causes many square hollows of idle PIM cells. We then propose a novel spatial mapping to exempt these “hollow” crossbar from the inter-crossbar data path. To further squeeze the weights on fewer crossbars, we decouple the intra-crossbar data path from the hardware bitline by a novel temporal scheduling, so that bits with different magnitudes can be placed on cells along the same bitline. Finally, the proposed IVQ includes a temporal pipeline to avoid the introduced stalling cycles, and a data flow with delicate control mechanisms for the new intra-and inter-crossbar data paths. Putting all together, IVQ achieves 19.7×, 10.7×, 4.7×∼63.4×, 91.7× speedup, and 17.7×, 5.1×, 5.7×∼68.1×, 541× energy savings over two PIM accelerators (ISAAC and CASCADE), two customized quantization accelerators (based on ASIC and FPGA), and NVIDIA RTX 2080 GPU, respectively.

DOI10.1109/TCAD.2022.3156017
Short TitleIeee Transactions on Computer Aided Design of Integrated Circuits and Systems