Memory-Optimized Wavefront Parallelism on GPUs

PDF / 1,676,189 Bytes
24 Pages / 439.37 x 666.142 pts Page_size
3 Downloads / 202 Views

Memory‑Optimized Wavefront Parallelism on GPUs Yuanzhe Li1 · Loren Schwiebert1 Received: 7 June 2019 / Accepted: 10 March 2020 © Springer Science+Business Media, LLC, part of Springer Nature 2020

Abstract Wavefront parallelism is a well-known technique for exploiting the concurrency of applications that execute nested loops with uniform data dependencies. Recent research of such applications, which range from sequence alignment tools to partial differential equation solvers, has used GPUs to benefit from the massively parallel computing resources. To achieve optimal performance, tiling has been introduced as a popular solution to achieve a balanced workload. However, the use of hyperplane tiles increases the cost of synchronization and leads to poor data locality. In this paper, we present a highly optimized implementation of the wavefront parallelism technique that harnesses the GPU architecture. A balanced workload and maximum resource utilization are achieved with an extremely low synchronization overhead. We design the kernel configuration to significantly reduce the minimum number of synchronizations required and also introduce an inter-block lock to minimize the overhead of each synchronization. In addition, shared memory is used in place of the L1 cache. The well-tailored mapping of the operations to the shared memory improves both spatial and temporal locality. We evaluate the performance of our proposed technique for four different applications: sequence alignment, edit distance, summed-area table, and 2D-SOR. The performance results demonstrate that our method achieves speedups of up to six times compared to the previous bestknown hyperplane tiling-based GPU implementation. Keywords Wavefront parallelism · GPU · Nested loop · Shared memory · Hyperplane tile

1 Introduction Modern GPUs are widely deployed for solving parallel applications that are applicable to SIMD processing paradigm. Because of the equipped massive cores, achieving high memory efficiency is especially important to achieving full * Yuanzhe Li [email protected] 1

Computer Science Department, Wayne State University, Detroit, USA

13

Vol.:(0123456789)

International Journal of Parallel Programming

processor occupancy on GPUs, which can be improved by coalesced memory access patterns and data reuse of on-chip memory. However, optimizing memory accesses for applications that have unaligned or nonconsecutive data access patterns, as wavefront parallelism does, are challenging. Wavefront parallelism is a technique for exploiting parallelism in nested loops. In a two-dimensional matrix, the computations proceed along diagonal waves, because each data entry is updated based on the already updated neighboring entries. During the computation, the execution of wave iterations are serialized to ensure the correctness for updating the data entries; the data entries of each wave iteration can be executed concurrently. Therefore, data dependencies prevent consecutively stored data from being processed in parallel. Parallel processing the data

Data Loading...

Memory-Optimized Wavefront Parallelism on GPUs

Recommend Documents

Combining Task- and Data-Level Parallelism for High-Throughput CNN Inference on Embedded CPUs-GPUs MPSoCs

Revisiting ECM on GPUs

Expressing Parallelism

Inter-Operator Parallelism

Inter-Query Parallelism

User-Level Parallelism

Efficient Non-fused Winograd on GPUs

Dynamic Sparse-Matrix Allocation on GPUs

GPUs-RRTMG_LW: high-efficient and scalable computing for a longwave radiative transfer model on multiple GPUs

A Study of Overflow Vulnerabilities on GPUs

Concurrency, Parallelism, and Performances

Operator-Level Parallelism