Learning to Train CNNs on Faulty ReRAM-based Manycore Accelerators

TitleLearning to Train CNNs on Faulty ReRAM-based Manycore Accelerators
Publication TypeJournal Article
Year of Publication2021
AuthorsBK Joardar,, H Li, K Chakrabarty, and PP Pande
JournalAcm Transactions on Embedded Computing Systems
Start Page1
Pagination1 - 23
Date Published10/2021

The growing popularity of convolutional neural networks (CNNs) has led to the search for efficient computational platforms to accelerate CNN training. Resistive random-access memory (ReRAM)-based manycore architectures offer a promising alternative to commonly used GPU-based platforms for training CNNs. However, due to the immature fabrication process and limited write endurance, ReRAMs suffer from different types of faults. This makes training of CNNs challenging as weights are misrepresented when they are mapped to faulty ReRAM cells. This results in unstable training, leading to unacceptably low accuracy for the trained model. Due to the distributed nature of the mapping of the individual bits of a weight to different ReRAM cells, faulty weights often lead to exploding gradients. This in turn introduces a positive feedback in the training loop, resulting in extremely large and unstable weights. In this paper, we propose a lightweight and reliable CNN training methodology using weight clipping to prevent this phenomenon and enable training even in the presence of many faults. Weight clipping prevents large weights from destabilizing CNN training and provides the backpropagation algorithm with the opportunity to compensate for the weights mapped to faulty cells. The proposed methodology achieves near-GPU accuracy without introducing significant area or performance overheads. Experimental evaluation indicates that weight clipping enables the successful training of CNNs in the presence of faults, while also reducing training time by 4
on average compared to a conventional GPU platform. Moreover, we also demonstrate that weight clipping outperforms a recently proposed error correction code (ECC)-based method when training is carried out using faulty ReRAMs.

Short TitleAcm Transactions on Embedded Computing Systems