|Title||STT-RAM cache hierarchy with multiretention MTJ designs|
|Publication Type||Journal Article|
|Year of Publication||2014|
|Authors||Z Sun, X Bi, H Li, WF Wong, and X Zhu|
|Journal||Ieee Transactions on Very Large Scale Integration (Vlsi) Systems|
|Pagination||1281 - 1293|
Spin-transfer torque random access memory (STT-RAM) is the most promising candidate to be universal memory due to its good scalability, zero standby power, and radiation hardness. Having a cell area only 1/9 to 1/3 that of SRAM, allows for a much larger cache with the same die footprint. Such reduction of cell size can significantly shrink the cache array size, leading to significant improvement of overall system performance and power consumption, especially in this multicore era where locality is crucial. However, deploying STT-RAM technology in L1 caches is challenging because write operations on STT-RAM are slow and power-consuming. In this paper, we propose a range of cache hierarchy designs implemented entirely using STT-RAM that deliver optimal power saving and performance. In particular, our designs use STT-RAM cells with various data retention times and write performances, made possible by novel magnetic tunneling junction designs. For L1 caches where speed is of utmost importance, we propose a scheme that uses fast STT-RAM cells with reduced data retention time coupled with a dynamic refresh scheme. In the dynamic refresh scheme, another emerging technology, memristor, is used as the counter to monitor the data retention of the low-retention STT-RAM, achieving a higher array area efficiency than an SRAM-based counter. For lower level caches with relatively larger cache capacities, we propose a design that has partitions of different retention characteristics, and a data migration scheme that moves data between these partitions. The experiments show that on the average, our proposed multiretention level STT-RAM cache reduces total energy by as much as 30%-74.2% compared to previous single retention level STT-RAM caches, while improving instruction per cycle performance for both two-level and three-level cache hierarchies. © 2013 IEEE.
|Short Title||Ieee Transactions on Very Large Scale Integration (Vlsi) Systems|