A Novel Fault-Tolerant Parallel Algorithm

The mean-time-between-failure of current high-performance computer systems is much shorter than the running times of many computational applications, whereas those applications are the main workload for those systems. Currently, checkpoint/restart is the

PDF / 539,952 Bytes
12 Pages / 430 x 660 pts Page_size
80 Downloads / 284 Views

DOWNLOAD

REPORT

Abstract. The mean-time-between-failure of current high-performance computer systems is much shorter than the running times of many computational applications, whereas those applications are the main workload for those systems. Currently, checkpoint/restart is the most commonly used scheme for such applications to tolerate hardware failures. But this scheme has its performance limitation when the number of processors becomes much larger. In this paper, we propose a novel fault-tolerant parallel algorithm FPAPR. First, we introduce the basic idea of FPAPR. Second, we specify the details of how to implement a FPAPR program by using two NPB kernels as examples. Third, we theoretically analyze the overhead of FPAPR, and ﬁnd out that the overhead of FPAPR decreases with the increase of the number of processors. At last, the experimental results on a 512-CPU cluster show the overhead introduced by the algorithm is very small. Keywords: high-performance computing, fault tolerance, parallel algorithm.

1

Introduction

There is a trend that the high-performance computing systems are consisted of more and more processors. The fastest one, IBM Blue Gene/L, has 131, 072 processors, and even the smallest computer system in the Top10 has 9024 processors. However, as the complexity of a computer system increases, its reliability is drastically deteriorating. For example, if the reliability of individual components is 99.999%, then for the whole system consisting of 100, 000 non-redundant components, the reliability is nothing more than (99.999%)100000 = 36.79%. Such a low reliability is unacceptable in most applications. A critical issue for machines of large size is the mean time between failures. Projecting from the existing supercomputers, a 100,000 processors supercomputer could see a failure every few minutes. But the applications running on these supercomputers are typically compute-intensive and will take a long time. Therefore, it is an important ability for computer systems consisting of such a large amount of processors to deal with processor failures. M. Xu et al. (Eds.): APPT 2007, LNCS 4847, pp. 18–29, 2007. c Springer-Verlag Berlin Heidelberg 2007

A Novel Fault-Tolerant Parallel Algorithm

19

Today, applications typically deal with process failures by writing out checkpoints periodically[7,3]. All processors save their computation state to a storage server periodically during the execution. Even if there is no failure, the overhead is inevitable. If a fault occurs, then all processes are forced to stop and the job is reloaded from the last checkpoint. For a large-scale system, checkpoint/restart may not obtain an eﬀective utilization of the resources. In this paper, we propose a novel Fault-tolerant Parallel Algorithm based on Parallel Recomputing (FPAPR for short). If there is no failure, the overhead of FPAPR is insigniﬁcant. If there is a failure, the overhead of FPAPR decreases with the increase of the processor number. The rest of the paper is organized as follows. Section 2 is an introduction of related

Data Loading...

A Novel Fault-Tolerant Parallel Algorithm

Recommend Documents

A Parallel Rollout Algorithm for Wildfire Suppression

A Novel Low Illumination Image Enhancement Algorithm

A Novel Consensus Algorithm for Alliance Chain

High performance parallel KMP algorithm on a heterogeneous architecture

The scalability analysis of a parallel tree search algorithm

TUKNN: A Parallel KNN Algorithm to Handle Large Data

A Niching Gene Expression Programming Algorithm Based on Parallel Model

A Parallel Algorithm Synthesis Procedure for High-Performance Computer Architectures

A parallel hybrid krill herd algorithm for feature selection

Coloring Vertices of a Graph Using Parallel Genetic Algorithm

Image Segmentation: A Novel Cluster Ensemble Algorithm

A Parallel Fuzzy Algorithm for Real-Time Medical Image Enhancement