Analysis of parallel application checkpoint storage for system configuration

PDF / 3,137,835 Bytes
36 Pages / 439.37 x 666.142 pts Page_size
100 Downloads / 216 Views

Analysis of parallel application checkpoint storage for system configuration Betzabeth León1 · Daniel Franco1 · Dolores Rexachs1 · Emilio Luque1 Accepted: 30 September 2020 © Springer Science+Business Media, LLC, part of Springer Nature 2020

Abstract The use of fault tolerance strategies such as checkpoints is essential to maintain the availability of systems and their applications in high-performance computing environments. However, checkpoint storage can impact the performance and scalability of parallel applications that use message passing. In the present work, a study is carried out on the elements that can impact the storage of the checkpoint and how these can influence the scalability of an application with fault tolerance. A methodology has been designed based on predicting the size of the checkpoint when the number of processes, the application workload or the mapping varies, using a reduced number of resources. By following this methodology, the system administrator will be able to make decisions about what should be done with the number of processes used and the number of appropriate nodes, adjusting the process mapping in applications that use checkpoints. Keywords Fault tolerance · Checkpoint · Scalability · HPC systems · MPI application

* Betzabeth León [email protected] Daniel Franco [email protected] Dolores Rexachs [email protected] Emilio Luque [email protected] 1

Computer Architecture and Operating Systems Department, Universitat Autònoma de Barcelona, 08193 Bellaterra, Barcelona, Spain

13

Vol.:(0123456789)

B. León et al.

1 Introduction In systems with long execution times, it is necessary that they have fault tolerance. Checkpointing is a widely used technique to obtain fault tolerance in such environments. In a large-scale system that needs a long execution, there is more probability that it has failures, so accordingly, it must be checkpointed frequently. Parallel message passing applications are used in these distributed memory systems. In HPC systems, checkpoints must periodically write large volumes of data to capture the current state of the applications, which they compute and control in stages at regular intervals. The checkpointing operation is an I/O-intensive write operation, which can be executed on a large number of computing nodes (from now on, we will refer to them as nodes), which would generate thousands of files. This requires continuous interaction with the storage system and consequently occupies a large amount of space in terabytes of data. Therefore, the checkpoint can easily collapse the I/O system. For these types of strategies, such as checkpoints, to be useful on a large scale, the normal execution of the application should be affected as little as possible. Using strategies to reduce this costly storage in these high-performance systems is one way to reduce the overhead caused by these fault tolerance schemes. With respect to the applications and their ability to scale, it is necessary that when increasing the number of resources, the

Data Loading...

Analysis of parallel application checkpoint storage for system configuration

Recommend Documents

Optimized configuration and economic evaluation of on-board energy storage system for subway vehicles

A Reconfigurable FPGA System for Parallel Independent Component Analysis

Tight Bound of Parallel Request Latency for Erasure-Coded Distributed Storage System

Interaction Between Data Parallel Compilation and Data Transfer and Storage Cost Minimization for Multimedia Application

Neural Network Configuration for Pollen Analysis

Checkpoint

Checkpoint

Analysis of Checkpoint I/O Behavior

Analysis of Vehicle Energy Storage Brake Energy Recovery System

A High-Speed Parallel Accessing Scheduler of Space-Borne Nand Flash Storage System

PASM Parallel Processing System

A Case Study for the Application of Storage Tiering Based on ILM through Data Value Analysis