Integrating software and hardware hierarchies in an autotuning method for parallel routines in heterogeneous clusters

  • PDF / 1,249,686 Bytes
  • 20 Pages / 439.37 x 666.142 pts Page_size
  • 35 Downloads / 219 Views

DOWNLOAD

REPORT


Integrating software and hardware hierarchies in an autotuning method for parallel routines in heterogeneous clusters Jesús Cámara1 · Javier Cuenca1 · Domingo Giménez2

© Springer Science+Business Media, LLC, part of Springer Nature 2020

Abstract A hierarchical approach for autotuning linear algebra routines on heterogeneous platforms is presented. Hierarchy helps to alleviate the difficulties of tuning parallel routines for high-performance computing systems. This paper analyzes the application of the hierarchical approach at both the hardware and software levels, using the basic matrix multiplication and the Strassen multiplication as proof of concept on multicore+coprocessor nodes. In this way, the hierarchical approach allows partial delegation of the efficient exploitation of the computing units in the node to the underlying direct autotuned matrix multiplication used in the base case. Keywords  Autotuning · Hybrid programming · Heterogeneous computing · Multicore · Manycore

1 Introduction Today, standard computational nodes include one multicore CPU together with one or more coprocessors (typically GPUs and/or Many Integrated Core, e.g., the Intel Xeon Phi). The basic computational components of these nodes have different architectures and computational capacities; therefore, they can be organized/managed hierarchically, with the basic computing units (CPU, GPU and MIC) having separate memory spaces and communicating with data transfers between them across * Javier Cuenca [email protected] Jesús Cámara [email protected] Domingo Giménez [email protected] 1

Department of Engineering and Technology of Computers, University of Murcia, Murcia, Spain

2

Department of Computing and Systems, University of Murcia, Murcia, Spain



13

Vol.:(0123456789)



J. Cámara et al.

the memory associated with the CPUs and those of the coprocessors. This heterogeneous and hierarchical organization makes the efficient exploitation of routines for those nodes difficult and requires techniques for exploiting the underlying heterogeneity and hierarchy. Elsewhere, linear algebra routines are widely used as basic computational kernels in scientific software, and their optimization for today’s standard heterogeneous nodes would lead to important improvements when solving scientific problems based on highly efficient linear algebra libraries such as MKL [16], PLASMA [19], MAGMA [1] and Chameleon [8], whose routines base their optimization in implementations by blocks or tiles in which the basic kernel is a highly optimized matrix multiplication [13]. The matrix multiplication has been widely researched, and there are now many highly efficient implementations for today’s systems [14, 15, 17]. As with computational systems, the optimization of linear algebra routines has traditionally been based on a hierarchical schema [6], with a set of basic linear algebra routines (BLAS) and higher-level routines (LAPACK) developed by blocks or tiles. A hierarchical and decentralized schema can be applied for the automatic optimization of linear algebra sof