Micro-benchmarks for Cluster OpenMP Implementations: Memory Consistency Costs

The OpenMP memory model allows for a temporary view of shared memory that only needs to be made consistent when barrier or flush directives, including those that are implicit, are encountered. While this relaxed memory consistency model is key to developi

  • PDF / 454,244 Bytes
  • 11 Pages / 430 x 660 pts Page_size
  • 61 Downloads / 209 Views

DOWNLOAD

REPORT


Abstract. The OpenMP memory model allows for a temporary view of shared memory that only needs to be made consistent when barrier or flush directives, including those that are implicit, are encountered. While this relaxed memory consistency model is key to developing cluster OpenMP implementations, it means that the memory performance of any given implementation is greatly affected by which memory is used, when it is used, and by which threads. In this work we propose a microbenchmark that can be used to measure memory consistency costs and present results for its application to two contrasting cluster OpenMP implementations, as well as comparing these results with data obtained from a hardware supported OpenMP environment.

1

Introduction

Micro-benchmarks are synthetic programs designed to stress and measure the overheads associated with specific aspects of a hardware and/or software system. The information provided by micro-benchmarks is frequently used to, for example, improve the design of the system, compare the performance of different systems, or provide input to more complex models that attempt to rationalize the runtime behaviour of a complex application that uses the system. In the context of the OpenMP (OMP) programming paradigm, significant effort has been devoted to developing various micro-benchmark suites. For instance, shortly after its introduction Bull [3,4] proposed a suite of benchmarks designed to measure the overheads associated with the various synchronization, scheduling and data environment preparation OMP directives. Other notable OMP related work includes that of Sato et al. [16] and M¨ uller [12]. All existing OMP micro-benchmarks have been developed within the context of an underlying hardware shared memory system. This is understandable given that the vast majority of OMP applications are currently run on hardware shared memory systems, but it is now timely to reassess the applicability of these micro-benchmarks to other OMP implementations. Specifically, for many years there has been interest in running OMP applications on distributed memory hardware such as clusters [14,8,9,2,7]. Most of these implementations have been experimental and of a research nature, but recently Intel released a commercial product that supports OMP over a cluster – Cluster OpenMP (CLOMP) [6]. This R. Eigenmann and B.R. de Supinski (Eds.): IWOMP 2008, LNCS 5004, pp. 60–70, 2008. c Springer-Verlag Berlin Heidelberg 2008 

Micro-benchmarks for Cluster OpenMP Implementations

61

interest, plus the advent of new network technologies that offer exceptional performance and advanced features like Remote Direct Memory Access (RDMA), stands to make software OMP implementations considerably more common in the future. Also it is likely that the division between hardware and software will become increasingly blurred. For example, in recent work Zeffer and Hagersten [19] have proposed a set of simplified hardware primitives for multi-core chips that can be used to support software implemented inter-node coherence. This paper a