Optimizing non-coalesced memory access for irregular applications with GPU computing

PDF / 760,043 Bytes
17 Pages / 595.276 x 841.89 pts (A4) Page_size
60 Downloads / 193 Views

2020 21(9):1285-1301

1285

Frontiers of Information Technology & Electronic Engineering www.jzus.zju.edu.cn; engineering.cae.cn; www.springerlink.com ISSN 2095-9184 (print); ISSN 2095-9230 (online) E-mail: [email protected]

Optimizing non-coalesced memory access for irregular applications with GPU computing∗ Ran ZHENG‡1,2,3,4 , Yuan-dong LIU1,2,3,4 , Hai JIN1,2,3,4 1National

Engineering Research Center for Big Data Technology and System,

Huazhong University of Science and Technology, Wuhan 430074, China 2Services

Computing Technology and System Lab, Huazhong University of Science and Technology, Wuhan 430074, China

3Cluster 4School

and Grid Computing Lab, Huazhong University of Science and Technology, Wuhan 430074, China

of Computer Science and Technology, Huazhong University of Science and Technology, Wuhan 430074, China E-mail: [email protected]; [email protected]; [email protected] Received May 24, 2019; Revision accepted May 31, 2020; Crosschecked Aug. 10, 2020

Abstract: General purpose graphics processing units (GPGPUs) can be used to improve computing performance considerably for regular applications. However, irregular memory access exists in many applications, and the beneﬁts of graphics processing units (GPUs) are less substantial for irregular applications. In recent years, several studies have presented some solutions to remove static irregular memory access. However, eliminating dynamic irregular memory access with software remains a serious challenge. A pure software solution without hardware extensions or oﬄine proﬁling is proposed to eliminate dynamic irregular memory access, especially for indirect memory access. Data reordering and index redirection are suggested to reduce the number of memory transactions, thereby improving the performance of GPU kernels. To improve the eﬃciency of data reordering, an operation to reorder data is oﬄoaded to a GPU to reduce overhead and thus transfer data. Through concurrently executing the compute uniﬁed device architecture (CUDA) streams of data reordering and the data processing kernel, the overhead of data reordering can be reduced. After these optimizations, the volume of memory transactions can be reduced by 16.7%–50% compared with CUSPARSE-based benchmarks, and the performance of irregular kernels can be improved by 9.64%–34.9% using an NVIDIA Tesla P4 GPU. Key words: General purpose graphics processing units; Memory coalescing; Non-coalesced memory access; Data reordering https://doi.org/10.1631/FITEE.1900262 CLC number: TP319

1 Introduction In recent years, general purpose graphics processing units (GPGPUs) are becoming an integral part of modern system architectures. Given their massively parallel architecture, graphics processing units (GPUs) can signiﬁcantly accelerate many reg‡ *

Corresponding author

Project supported by the National Key Research and Development Program of China (No. 2018YFB1003500) ORCID: Ran ZHENG, https://orcid.org/0000-0002-3058-7581 c Zhejiang University and Springer-Verlag GmbH Germany, part of Springer Nature 2020

Data Loading...

Optimizing non-coalesced memory access for irregular applications with GPU computing

Recommend Documents

Blockchain for Mobile Health Applications Acceleration with GPU Computing

Accelerating Swarm Intelligence Algorithms with GPU-Computing

Optimizing the Thermomechanics of Shape-Memory Polymers for Biomedical Applications

MemShield: GPU-Assisted Software Memory Encryption

Optimizing B\(^+\) -Tree Searches on Coupled CPU-GPU Architectures

Random Access Memory (RAM)

Coded Access Architectures for Dense Memory Systems

Illegal Memory Access

Access Control Management for Ubiquitous Computing

Optimizing Access to Memory Pages in Software-Implemented Global Page Cache Systems

GPU Memory Access Optimization for 2D Electrical Wave Propagation Through Cardiac Tissue and Karma Model Using Time and

Compiler Optimizing for Power Efficiency of On-Chip Memory