Tall-and-skinny QR factorization with approximate Householder reflectors on graphics processors

  • PDF / 1,737,998 Bytes
  • 16 Pages / 439.37 x 666.142 pts Page_size
  • 109 Downloads / 173 Views

DOWNLOAD

REPORT


Tall‑and‑skinny QR factorization with approximate Householder reflectors on graphics processors Andrés E. Tomás1,2   · Enrique S. Quintana‑Ortí3

© Springer Science+Business Media, LLC, part of Springer Nature 2020

Abstract We present a novel method for the QR factorization of large tall-and-skinny matrices that introduces an approximation technique for computing the Householder vectors. This approach is very competitive on a hybrid platform equipped with a graphics processor, with a performance advantage over the conventional factorization due to the reduced amount of data transfers between the graphics accelerator and the main memory of the host. Our experiments show that, for tall–skinny matrices, the new approach outperforms the code in MAGMA by a large margin, while it is very competitive for square matrices when the memory transfers and CPU computations are the bottleneck of the Householder QR factorization. Keywords  QR factorization · Tall-and-skinny matrices · GPU · Householder vector · Look-ahead · High performance

1 Introduction There exist relevant applications that require the computation of an orthonormal basis for a relatively small set of very long vectors, which form a “talland-skinny” (TS) matrix. This type of problem appears, among others, in the

* Andrés E. Tomás [email protected] Enrique S. Quintana‑Ortí [email protected] 1

Dept. d’Enginyeria i Ciència dels Computadors, Universitat Jaume I, 12071 Castelló de la Plana, Spain

2

Dept. de Sistemes Informàtics i Computació, Universitat Politècnica de València, 46022 Valencia, Spain

3

Dept. d’Informàtica de Sistemes i Computadors, Universitat Politècnica de València, 46022 Valencia, Spain



13

Vol.:(0123456789)



A. E. Tomás, E. S. Quintana‑Ortí

orthogonalization in Krylov subspace methods  [15]; in the analysis of big data applications characterized with a few descriptors only (e.g. large data sets with a few variables produced by long data acquisitions of several sensors) [4, 12]; and as a preprocessing step when computing the singular value decomposition (SVD) of a matrix [11] with many more rows than columns. The conventional blocked QR factorization based on Householder reflectors  [11], hereafter QRF-H, is not an efficient algorithm for the factorization of tall-and-skinny (TS) matrices on modern parallel processors; see, e.g.  [6]. The reason is that, for matrices with a large number of rows but few columns, the fraction of work of QRF-H that can be cast in terms of kernels from the Level-3 of BLAS (basic linear algebra subprograms  [7]), as part of the highly parallel and efficient trailing update, cannot compensate the overwhelming cost of the panel factorization, which is performed via the much slower Level-1 and Level-2 BLAS. The TSQR-H [6] algorithm follows the ideas of the incremental QRF-H [12] to obtain a method that is especially appealing for the factorization of TS matrices. In this procedure, the panel is split into small “square-like” blocks, and a QRF-H is computed for each block. These small QRF-H are then merged