Optimization of Collective Communication in Intra-cell MPI

The Cell is a heterogeneous multi-core processor, which has eight co-processors, called SPEs. The SPEs can access a common shared main memory through DMA, and each SPE can directly operate on a small distinct local store. An MPI implementation can use eac

  • PDF / 307,813 Bytes
  • 12 Pages / 430 x 660 pts Page_size
  • 2 Downloads / 224 Views

DOWNLOAD

REPORT


Dept. of Mathematics and Computer Science, Sri Sathya Sai University 2 IBM, Austin 3 Dept. of Computer Science, Florida State University [email protected]

Abstract. The Cell is a heterogeneous multi-core processor, which has eight coprocessors, called SPEs. The SPEs can access a common shared main memory through DMA, and each SPE can directly operate on a small distinct local store. An MPI implementation can use each SPE as if it were a node for an MPI process. In this paper, we discuss the efficient implementation of collective communication operations for intra-Cell MPI, both for cores on a single chip, and for a Cell blade. While we have implemented all the collective operations, we describe in detail the following: barrier, broadcast, and reduce. The main contributions of this work are (i) describing our implementation, which achieves low latencies and high bandwidths using the unique features of the Cell, and (ii) comparing different algorithms, and evaluating the influence of the architectural features of the Cell processor on their effectiveness. Keywords: Cell Processor, MPI, heterogeneous multicore processor.

1 Introduction The Cell is a heterogeneous multi-core processor from Sony, Toshiba and IBM. There has been much interest in using it in High Performance Computing, due to the high flop rates it provides. However, applications need significant changes to fully exploit the novel architecture. A few different models of the use of MPI on the Cell have been proposed to deal with the programming difficulty, as explained later. In all these, it is necessary to implement collective communication operations efficiently within each Cell processor or blade. In this paper, we describe the efficient implementation of a variety of algorithms for a few important collective communication operations, and evaluate their performance. The outline of the rest of the paper is as follows. In §2, we describe the architectural features of the Cell that are relevant to the MPI implementation, and MPI based programming models for the Cell. We explain common features of our implementations in §3.1. We then describe the implementations and evaluate the performance of MPI_Barrier, MPI_Broadcast, and MPI_Reduce in §3.2, §3.3, and §3.4 respectively. S. Aluru et al. (Eds.): HiPC 2007, LNCS 4873, pp. 488–499, 2007. © Springer-Verlag Berlin Heidelberg 2007

Optimization of Collective Communication in Intra-Cell MPI

489

We summarize our conclusions in §4. Further details on this work are available in a technical report [4].

2 Cell Architecture and MPI Based Programming Models Architecture. Figure 1 shows an overview of the Cell processor. It consists of a cache coherent PowerPC core (PPE), which controls eight SIMD cores called Synergistic Processing Elements (SPEs). All cores run at 3.2 GHz and execute instructions in-order. The Cell has a 512 MB to 2 GB external main memory, and an XDR memory controller provides access to it at a rate of 25.6 GB/s. The PPE, SPE, DRAM and I/O controllers are all connected via four data rings, c