Towards efficient tile low-rank GEMM computation on sunway many-core processors

  • PDF / 2,525,169 Bytes
  • 32 Pages / 439.37 x 666.142 pts Page_size
  • 19 Downloads / 201 Views

DOWNLOAD

REPORT


Towards efficient tile low‑rank GEMM computation on sunway many‑core processors Qingchang Han1 · Hailong Yang1,2   · Ming Dun3 · Zhongzhi Luan1 · Lin Gan4 · Guangwen Yang4 · Depei Qian1 Accepted: 29 September 2020 © Springer Science+Business Media, LLC, part of Springer Nature 2020

Abstract Tile low-rank general matrix multiplication (TLR GEMM) is a novel method of matrix multiplication on large data-sparse matrices, which can significantly reduce storage footprint and arithmetic complexity under given accuracy. To implement high-performance TLR GEMM on Sunway many-core processor, the following challenges remain to be addressed: 1) design an efficient parallel scheme; 2) provide an efficient kernel library of math functions commonly used in TLR GEMM. This paper proposes swTLR GEMM, an efficient implementation of TLR GEMM. We assign LR GEMM computation to a single computing processing element (CPE) and use grouped task queue to process different data tiles of the TLR matrix. Moreover, we implement an efficient kernel library (swLR Kernels) for low-rank matrix operations. To scale to massive (CGs), we organize the CGs into the CG grid and partition the matrices into blocks accordingly. We also apply Cannon’s algorithm to enable efficient communication when processing the matrix blocks across CGs simultaneously. The experiment results show that the DGEMM kernel in swLR Kernels achieves 102× speedup on average. In terms of overall performance, swTLR GEMM-LLD and swTLR GEMM-LLL achieve 91× and 20.1× speedup on average, respectively. In addition, our implementation of swTLR GEMM exhibits good scalability when running on 1,024 CGs of Sunway processors (66,560 cores in total). Keywords  Tile low-rank GEMM · Performance optimization · Sunway architecture · Scalability

* Hailong Yang [email protected] Extended author information available on the last page of the article

13

Vol.:(0123456789)



Q. Han et al.

1 Introduction In recent years, as the volume of data continues to increase in scientific computing, the surging demand for computation and storage has hindered the further improvement of application performance. H-matrices [1] use data-sparse approximation to represent non-sparse matrices, which can significantly reduce the storage footprint of the matrices and arithmetic complexity of related operations [2]. Thus, H-matrices have been widely applied in integral equations and elliptic partial differential equations. However, the parallelism of H-matrix on many-core processors is limited by its recursive nature. The tile low-rank (TLR) format [3] is an approximate representation of the matrix whose non-diagonal tiles are compressed into the matrices in low-rank (LR) format. It flattens the recursive trees [4] of H-matrix, which is easier to be parallelized. General matrix multiplication (GEMM) [5] is the most important numerical kernel in scientific computing and other fields. Compared to dense GEMM, GEMM in TLR format can reduce the number of calculations by more than an order of magnitude. Therefore, there