Scalability of Partial Differential Equations Preconditioner Resilient to Soft and Hard Faults

We present a resilient domain-decomposition preconditioner for partial differential equations (PDEs). The algorithm reformulates the PDE as a sampling problem, followed by a solution update through data manipulation that is resilient to both soft and hard

  • PDF / 939,722 Bytes
  • 17 Pages / 439.37 x 666.142 pts Page_size
  • 102 Downloads / 157 Views

DOWNLOAD

REPORT


stract. We present a resilient domain-decomposition preconditioner for partial differential equations (PDEs). The algorithm reformulates the PDE as a sampling problem, followed by a solution update through data manipulation that is resilient to both soft and hard faults. We discuss an implementation based on a server-client model where all state information is held by the servers, while clients are designed solely as computational units. Servers are assumed to be “sandboxed”, while no assumption is made on the reliability of the clients. We explore the scalability of the algorithm up to ∼12k cores, build an SST/macro skeleton to extrapolate to ∼50k cores, and show the resilience under simulated hard and soft faults for a 2D linear Poisson equation.

1

Introduction

As computing platforms evolve towards exascale, several key challenges are arising related to resiliency, energy consumption, memory access, concurrency and heterogeneous hardware [1,6,7,10,11]. There is no consensus or clear idea yet on what a “typical” exascale architecture might look like [1]. One of the main concerns is understanding how the hardware will affect future computing systems in terms of reliability, communication and computational models, and which ones will emerge to become the main reference for exascale. Exascale simulations are expected to rely on thousands of CPU cores running up to a billion threads [6,7]. This framework will lead to systems with a large number of components, and large communication cost for data exchange. The presence of many components and the increasing complexity of these systems, I’m an employee of the US Government and transfer the rights to the extent transferable (Title 17 §105 U.S.C. applies) c Springer International Publishing Switzerland 2016 (outside the US)  J.M. Kunkel et al. (Eds.): ISC High Performance 2016, LNCS 9697, pp. 469–485, 2016. DOI: 10.1007/978-3-319-41321-1 24

470

K. Morris et al.

e.g. more and smaller transistors, and lower voltages, can become a liability in terms of system faults. Exascale systems are expected to suffer from errors and faults more frequently than the current petascale systems [6,7]. Current parallel programming models and implementations will require a resilient infrastructure to be suitable for fault-free simulations across many cores in reasonable amounts of time. In general, system faults can be grouped under two main categories, namely hard and soft faults [6,16]. Hard faults can cause partial or full computing nodes to fail, or the network to crash. These faults have an evident impact on the run and the system itself. Soft errors, on the other hand, are more subtle because some of them can be undetected, e.g. in the case of silent data corruption (SDC). The reason is that their effect is simply to alter information where it is stored, transmitted, or processed. The key feature of silent errors is that, when undetected, there is no opportunity for an application to directly recover from the fault when it occurs. Currently, application checkpoint-restart is the most