Fault Recovery Methods for Asynchronous Linear Solvers

PDF / 1,864,022 Bytes
30 Pages / 439.37 x 666.142 pts Page_size
103 Downloads / 297 Views

Fault Recovery Methods for Asynchronous Linear Solvers Evan Coleman1,2

· Erik J. Jensen1 · Masha Sosonkina1

Received: 25 June 2019 / Accepted: 24 August 2020 © This is a U.S. Government work and not under copyright protection in the US; foreign copyright protection may apply 2020

Abstract This study seeks to understand the soft error vulnerability of asynchronous iterative methods, with a focus on stationary iterative solvers such as Jacobi. A theoretical investigation into the performance of the asynchronous iterative methods is presented and used to motivate several fault recovery methods for asynchronous linear solvers. The numerical experiments utilize a hybrid-parallel implementation where the computational work is distributed over multiple nodes using MPI and parallelized on each node using OpenMP, and a series of runs are conducted to measure both the impact of soft faults and the effectiveness of the recovery methods. Trials are run to compare two models for simulating the occurrence of a fault as well as techniques for recovering from the effects of a fault. The results show that the proposed strategies can effectively recover from the impact of a fault and that the numerical model for simulating soft faults consistently produces fault effects that enable the investigation and tuning of recovery techniques in action. Keywords Soft fault · Fault tolerance · Fault models · Asynchronous iteration · Linear system solver

1 Introduction In high performance computing (HPC) environments, it is important to keep in mind the need for developing algorithms that are resilient to faults. On future platforms, the rate at which faults occur is expected to increase dramatically [12]. Because of this, developing algorithms that are resilient to faults is of paramount importance, especially on the road towards exascale.

B

Evan Coleman [email protected]

1

Computational Modeling and Simulation Engineering, Old Dominion University, Norfolk, VA, USA

2

Strategic and Computing Systems Department, Naval Surface Warfare Center, Dahlgren Division, Dahlgren, VA, USA

123

International Journal of Parallel Programming

Faults can broadly be divided into two categories: hard faults and soft faults [10]. Hard faults cause immediate program interruption and typically come from negative effects on the physical hardware components of the system or on the operating system itself. Soft faults represent all faults that do not cause the executing program to stop, although its interruption may occur as a result of their impact. Transient soft faults are typically caused by solitary bit flips, occurring due to such events as, for example, radiation, hardware malfunction, or data-cache set incorrectly. The most important aspect to recovering from a soft fault is successful fault detection. However, this is often difficult in the case of a soft fault since, though it corrupts data, it does not cause direct interruption to the flow of the iterative process. Many detection techniques rely on choosing an appropriate tolerance to check

Data Loading...

Fault Recovery Methods for Asynchronous Linear Solvers

Recommend Documents

Linear Equations Solvers

Fast solvers for tridiagonal Toeplitz linear systems

Asynchronous One-Level and Two-Level Domain Decomposition Solvers

Asynchronous level bundle methods

Advanced methods for fault diagnosis and fault-tolerant control

Software-Based Fault Detection and Recovery for Cyber-Physical Systems

Performance Recovery and Fault-Tolerant Control Schemes

Fault Detection in Linear Time-Varying Systems

Solving Fault Diagnosis Problems Linear Synthesis Techniques

Semi-Infinite Programming: Methods for Linear Problems

Fault Estimation in Linear Dynamic Systems

A Comparison of Parallel Solvers for Diagonally Dominant and General Narrow-Banded Linear Systems II