Piecewise Execution of Nested Parallel Programs - A Thread-Based Approach

In this paper, a multithreaded implementation technique for piecewise execution of memory-intense nested data parallel programs is presented. The execution model and some experimental results are described.

  • PDF / 135,114 Bytes
  • 4 Pages / 431 x 666 pts Page_size
  • 23 Downloads / 205 Views

DOWNLOAD

REPORT


Abstract. In this paper, a multithreaded implementation technique for piecewise execution of memory-intense nested data parallel programs is presented. The execution model and some experimental results are described.

1

Introduction

The flattening transformation introduced by Blelloch & Sabot allows us to implement nested data parallelism while exploiting all the parallelism contained in nested programs [2]. The resulting programs usually have a high degree of excess parallelism, which can lead to increased memory requirements. E.g., in many programs, operations known as generators produce large vectors, which are reduced to smaller vectors (or scalars) by accumulators almost immediately afterwards. High memory consumption is one of the most serious problems of nested data parallel languages. Palmer, Prins, Chatterjee & Faith introduced an implementation technique known as Piecewise Execution [4]. Here, the vector operations work on vector pieces of constant size only. In this way, low memory bounds for a certain class of programs can be guaranteed. One algorithm in this class is, e.g. the calculation of the maximum force between any two particles in N-body computations [1]. In this paper, we propose a multi-threaded implementation technique of piecewise execution. Constant memory usage can be guaranteed for a certain class of programs. Furthermore, the processor cache is utilized well.

2

Piecewise Execution

The Nesl program presented in Figure 1 is a typical example of a matching generator/accumulator pair. The operation [s:e] enumerates all integers from s to e, plus scan calculates all prefix sums, {x * x : x in b} denotes the elementwise parallel squaring of b, and sum sums up all values of a vector in parallel. Typically, there is a lot of excess parallelism in such expressions that must be partly sequentialized to match the size of the parallel machine. However, if the generator executes in one go, it produces a vector whose size is proportional to the degree of parallelism. In many cases, then, memory consumption is so high that the program cannot execute. P. Amestoy et al. (Eds.): Euro-Par’99, LNCS 1685, pp. 445–448, 1999. c Springer-Verlag Berlin Heidelberg 1999

446

W. Pfannenstiel Absolute Speedups 18 "par_piecewise_fusion_100M" "par_piecewise_library_100M" "par_fusion_100M" "par_library_100M"

16 14 12 Speedup

f unction SumSqScan (s, e) = let a = [s : e]; b = plus scan (a); c = {x ∗ x : x in b} in sum (c);

10 8 6 4 2 0 0

5

10

15

20

25

Processors

Figure 1: Example program

Figure 2: Speedups

In piecewise execution, each operation receives only a piece, of constant size, of its arguments at a time. The operation consumes the piece to produce (a part of) its result. If the piece of input is completely consumed, the next piece of input is requested. Once a full piece of output is produced, it is passed down as an input piece to the next vector operation. Thus, large intermediate vectors never exist in their full length. However, some means of control are needed to keep track of the computat