Designing Parallel Sparse Matrix Transposition Algorithm Using CSR for GPUs

In this chapter, we propose a parallel algorithm for sparse matrix transposition using CSR format to run on many-core GPUs, utilizing the tremendous computational power and memory bandwidth of the GPU offered by parallel programming in CUDA. Our code is r

  • PDF / 264,972 Bytes
  • 7 Pages / 439.37 x 666.142 pts Page_size
  • 57 Downloads / 267 Views

DOWNLOAD

REPORT


Abstract In this chapter, we propose a parallel algorithm for sparse matrix transposition using CSR format to run on many-core GPUs, utilizing the tremendous computational power and memory bandwidth of the GPU offered by parallel programming in CUDA. Our code is run on a quad-core Intel Xeon64 CPU E5507 platform and a NVIDIA GPU GTX 470 card. We measure the performance of our algorithm running with input ranging from smaller to larger matrices, and our experimental results show that the preliminary results are scaling well up to 512 threads and are promising for bigger matrices. Keywords CUDA • Sparse matrix transposition • Parallel programming

1 Introduction Matrix transposition is a basic operation in linear algebra that can be found in software packages such as MATLAB and LINPACK. The most recent works on parallel sparse matrix–transpose–vector multiplication using CSB were proposed by Buluc¸ et al. [2]; it is implemented using Cilk++ [3] in recursive style with no actual matrix transposition required. However, it needs more effort to port it for GPU. Krishnamoorthy et al. [4] proposed an efficient parallel algorithm for out-ofcore matrix transposition, where a large dense matrix is stored in disk. When the main memory has not enough memory to hold the whole matrix, the transposition is done by reading a part of the matrix at any time, and its goal is to minimize the

T.-H. Weng (*) • H. Pham • K.-C. Li Department of Computer Science and Information Engineering, Providence University, Taichung, Taiwan e-mail: [email protected] H. Jiang Department of Computer Science, Arkansas State University, Jonesboro, AR, USA J. Juang and Y.-C. Huang (eds.), Intelligent Technologies and Engineering Systems, Lecture Notes in Electrical Engineering 234, DOI 10.1007/978-1-4614-6747-2_31, # Springer Science+Business Media New York 2013

251

252

T.-H. Weng et al. 2

7 6

9

val

8

idx

4 5 1

ptr

1 7

Fig. 1 A matrix example and its CSR sparse format

number of I/O operations. Mateescu et al. [5] developed a dense matrix transpose to run on POWER7 machine by utilizing its cache model and prefetching technique. Stathis et al. [6] proposed the parallel transposition of matrix using special hardware mechanisms on vector architecture. Gustavson et al. [7] proposed two fast serial algorithms for sparse matrix multiplication and permuted transposition using CSR, but they were not parallelized. Among the sparse formats, CSR (compressed sparse row) and CSC (compressed sparse column) are the simplest and easy to understand and maintain. The CSR sparse format consists of three arrays, namely, val, idx, and ptr, as shown in Fig. 1. Let A be a sparse matrix whose order is r by c and NNZ is the number of nonzeros of A. The values of the nonzero elements are stored in a double-precision array val of length NNZ. The column indices for each nonzero element are stored in the integer array idx of length NNZ. The pointers to each starting row of arrays idx and val are stored in integer array ptr of length r. Our algorithm converts a sparse matr