Fault Recovery Methods for Asynchronous Linear Solvers
- PDF / 1,864,022 Bytes
- 30 Pages / 439.37 x 666.142 pts Page_size
- 103 Downloads / 256 Views
Fault Recovery Methods for Asynchronous Linear Solvers Evan Coleman1,2
· Erik J. Jensen1 · Masha Sosonkina1
Received: 25 June 2019 / Accepted: 24 August 2020 © This is a U.S. Government work and not under copyright protection in the US; foreign copyright protection may apply 2020
Abstract This study seeks to understand the soft error vulnerability of asynchronous iterative methods, with a focus on stationary iterative solvers such as Jacobi. A theoretical investigation into the performance of the asynchronous iterative methods is presented and used to motivate several fault recovery methods for asynchronous linear solvers. The numerical experiments utilize a hybrid-parallel implementation where the computational work is distributed over multiple nodes using MPI and parallelized on each node using OpenMP, and a series of runs are conducted to measure both the impact of soft faults and the effectiveness of the recovery methods. Trials are run to compare two models for simulating the occurrence of a fault as well as techniques for recovering from the effects of a fault. The results show that the proposed strategies can effectively recover from the impact of a fault and that the numerical model for simulating soft faults consistently produces fault effects that enable the investigation and tuning of recovery techniques in action. Keywords Soft fault · Fault tolerance · Fault models · Asynchronous iteration · Linear system solver
1 Introduction In high performance computing (HPC) environments, it is important to keep in mind the need for developing algorithms that are resilient to faults. On future platforms, the rate at which faults occur is expected to increase dramatically [12]. Because of this, developing algorithms that are resilient to faults is of paramount importance, especially on the road towards exascale.
B
Evan Coleman [email protected]
1
Computational Modeling and Simulation Engineering, Old Dominion University, Norfolk, VA, USA
2
Strategic and Computing Systems Department, Naval Surface Warfare Center, Dahlgren Division, Dahlgren, VA, USA
123
International Journal of Parallel Programming
Faults can broadly be divided into two categories: hard faults and soft faults [10]. Hard faults cause immediate program interruption and typically come from negative effects on the physical hardware components of the system or on the operating system itself. Soft faults represent all faults that do not cause the executing program to stop, although its interruption may occur as a result of their impact. Transient soft faults are typically caused by solitary bit flips, occurring due to such events as, for example, radiation, hardware malfunction, or data-cache set incorrectly. The most important aspect to recovering from a soft fault is successful fault detection. However, this is often difficult in the case of a soft fault since, though it corrupts data, it does not cause direct interruption to the flow of the iterative process. Many detection techniques rely on choosing an appropriate tolerance to check
Data Loading...