SDAM: a combined stack distance-analytical modeling approach to estimate memory performance in GPUs

  • PDF / 2,468,554 Bytes
  • 28 Pages / 439.37 x 666.142 pts Page_size
  • 13 Downloads / 174 Views

DOWNLOAD

REPORT


SDAM: a combined stack distance‑analytical modeling approach to estimate memory performance in GPUs Mohsen Kiani1 · Amir Rajabzadeh1 Accepted: 21 October 2020 © Springer Science+Business Media, LLC, part of Springer Nature 2020

Abstract Graphics processing units (GPUs) are powerful in performing data-parallel applications. Such applications most often rely on the GPU’s memory hierarchy to deliver high performance. Designing efficient memory hierarchy for GPUs is a challenging task because of its wide architectural space. To moderate this challenge, this paper proposes a framework, called stack distance-analytic modeling (SDAM), to estimate memory performance of the GPU in terms of memory cycle counts. Providing the input data to the model is crucial in terms of the accuracy of the input data, and the time spent to obtain them. SDAM employs the stack distance analysis method and analytical modeling to obtain the required input accurately and swiftly. Further, it employs a detailed analytical model to estimate memory cycles. SDAM is validated against real GPU executions. Further, it is compared with a cycle accurate simulator. The experimental evaluations, performed on a set of memory-intensive benchmarks, prove that SDAM is faster and more accurate than cycle-accurate simulation, thus it can facilitate the GPU cache design-space exploration. For a selection of dataintensive benchmarks, SDAM showed a 32% average error in estimating memory data transfer cycles in a modern GPU, which outperforms cycle-accurate simulation, while it is an order of magnitude faster than the cycle-accurate simulation. Finally, the applicability of SDAM in exploring cache design-space in GPUs is demonstrated through experimenting with various cache designs. Keywords  GPU · Design-space exploration · Cache memory · Performance modeling · Stack distance analysis · Miss rate

* Amir Rajabzadeh [email protected] 1



Department of Computer Engineering and Information Technology, Razi University, Kermanshah, Iran

13

Vol.:(0123456789)



M. Kiani, A. Rajabzadeh

1 Introduction Graphics processing units (GPUs) have become a top-class processing unit in modern computing systems [38]. GPUs typically embody thousands of processing cores. Further, the achievable thread-level parallelism in GPUs can be orders of magnitude higher than their physical core count, making them suitable for data-parallel applications. Consequently, the large number of threads running simultaneously on a GPU can impose significant pressure on the GPU’s memory hierarchy. Thus, in the case of data-parallel and memory-intensive GPU applications that impose excessive memory traffics, the memory system is considered as the major performance bottleneck [15]. Currently, two levels of hardware-managed cache memories have been employed in GPUs to enhance their memory performance. Therefore, the cache memory hierarchy of GPUs significantly impacts the obtained performance in GPU applications. However, the capacity of cache memories with respect to the number of concurrent thre