VWS: A versatile warp scheduler for exploring diverse cache localities of GPGPU applications

Abstract

Massive multi-threading of GPGPU demands for efficient usage of caches with limited capacity. In this work, we propose a versatile warp scheduler (VWS) to reduce the cache miss rate in GPGPU. VWS retains the intra-warp cache locality using an efficient per-warp working set estimator and enhances intra-/inter-cooperative thread array (CTA) cache locality through imposing a CTA-aware scheduling policy and a new CTA dispatching mechanism. The significantly improved hit rate of cache hierarchy enables VWS to achieve on average 38.4% and 9.3% IPC improvement across diverse GPGPU applications compared to a widely-used and a state-of-the-art warp schedulers, respectively.

DOI
10.1145/2744769.2744931
Year