Micro-benchmarks for Cluster OpenMP Implementations: Memory Consistency Costs

The OpenMP memory model allows for a temporary view of shared memory that only needs to be made consistent when barrier or flush directives, including those that are implicit, are encountered. While this relaxed memory consistency model is key to developi

PDF / 454,244 Bytes
11 Pages / 430 x 660 pts Page_size
61 Downloads / 223 Views

DOWNLOAD

REPORT

Abstract. The OpenMP memory model allows for a temporary view of shared memory that only needs to be made consistent when barrier or flush directives, including those that are implicit, are encountered. While this relaxed memory consistency model is key to developing cluster OpenMP implementations, it means that the memory performance of any given implementation is greatly aﬀected by which memory is used, when it is used, and by which threads. In this work we propose a microbenchmark that can be used to measure memory consistency costs and present results for its application to two contrasting cluster OpenMP implementations, as well as comparing these results with data obtained from a hardware supported OpenMP environment.

1

Introduction

Micro-benchmarks are synthetic programs designed to stress and measure the overheads associated with speciﬁc aspects of a hardware and/or software system. The information provided by micro-benchmarks is frequently used to, for example, improve the design of the system, compare the performance of diﬀerent systems, or provide input to more complex models that attempt to rationalize the runtime behaviour of a complex application that uses the system. In the context of the OpenMP (OMP) programming paradigm, signiﬁcant eﬀort has been devoted to developing various micro-benchmark suites. For instance, shortly after its introduction Bull [3,4] proposed a suite of benchmarks designed to measure the overheads associated with the various synchronization, scheduling and data environment preparation OMP directives. Other notable OMP related work includes that of Sato et al. [16] and M¨ uller [12]. All existing OMP micro-benchmarks have been developed within the context of an underlying hardware shared memory system. This is understandable given that the vast majority of OMP applications are currently run on hardware shared memory systems, but it is now timely to reassess the applicability of these micro-benchmarks to other OMP implementations. Speciﬁcally, for many years there has been interest in running OMP applications on distributed memory hardware such as clusters [14,8,9,2,7]. Most of these implementations have been experimental and of a research nature, but recently Intel released a commercial product that supports OMP over a cluster – Cluster OpenMP (CLOMP) [6]. This R. Eigenmann and B.R. de Supinski (Eds.): IWOMP 2008, LNCS 5004, pp. 60–70, 2008. c Springer-Verlag Berlin Heidelberg 2008

Micro-benchmarks for Cluster OpenMP Implementations

61

interest, plus the advent of new network technologies that oﬀer exceptional performance and advanced features like Remote Direct Memory Access (RDMA), stands to make software OMP implementations considerably more common in the future. Also it is likely that the division between hardware and software will become increasingly blurred. For example, in recent work Zeﬀer and Hagersten [19] have proposed a set of simpliﬁed hardware primitives for multi-core chips that can be used to support software implemented inter-node coherence. This paper a

Data Loading...

Micro-benchmarks for Cluster OpenMP Implementations: Memory Consistency Costs

Recommend Documents

Memory Consistency

Strong Memory Consistency

Weak Memory Consistency

First Experiences with Intel Cluster OpenMP

Memory Consistency Models

Preliminary Experience with OpenMP Memory Management Implementation

A Study of Memory Anomalies in OpenMP Applications

OpenMP Shared Memory Parallel Programming International Workshop on

OpenMP Shared Memory Parallel Programming International Workshops, I

OpenMP: Memory, Devices, and Tasks 12th International Workshop on Op

Cube Implementations

Optimized Implementations for ZUC-256 on FPGA