Checkpointing Algorithms for Fault-Tolerant Execution of Large-Scale Distributed Applications in Cloud

  • PDF / 2,801,585 Bytes
  • 25 Pages / 439.37 x 666.142 pts Page_size
  • 81 Downloads / 193 Views

DOWNLOAD

REPORT


Checkpointing Algorithms for Fault‑Tolerant Execution of Large‑Scale Distributed Applications in Cloud Priti Kumari1 · Parmeet Kaur1  Accepted: 5 November 2020 © Springer Science+Business Media, LLC, part of Springer Nature 2020

Abstract Cloud computing provides infinite resources and a suitable environment for the execution of large scale computing applications. However, it is also susceptible to frequent failures which can affect users as well as service providers adversely. Therefore, fault tolerance techniques are necessary for the reliable execution of applications in the cloud. This work presents checkpointing based fault tolerance protocols for two types of distributed applications. The first kind of applications is the Bags of Tasks (BoT) applications where an application comprises of a set of independent tasks that do not communicate with each other during execution. Hence, an uncoordinated checkpointing algorithm is proposed for fault tolerance of BoT applications. Subsequently, we consider large scale distributed applications composed of multiple tasks dependent on each other due to inter-task message passing. An uncoordinated checkpointing and message logging protocol is presented for this type of applications. The proposed protocols utilize storage at edge switches in a data center to reduce the bandwidth consumption for saving checkpoints and message logs. Simulation results have demonstrated that the proposed protocols provide an increased rate of successful recoveries from failures and cause lower resource overhead than other contemporary and related schemes. Keywords  Cloud computing · Fault tolerance · Checkpointing · Message logging · Rollback recovery · BoT application · Distributed application

1 Introduction Checkpointing and rollback recovery (CRR) is an established technique for fault tolerance in distributed systems. This technique involves saving the consistent state of a system periodically as a checkpoint in stable storage. This state can be restored after a process failure for the successful recovery of the system [1]. Checkpointing protocols ensure that a failure does not result in complete loss of execution already performed. A * Parmeet Kaur [email protected] Priti Kumari [email protected] 1



Department of CSE/IT, Jaypee Institute of Information Technology, Noida, India

13

Vol.:(0123456789)



P. Kumari, P. Kaur

process can resume execution from the last saved state instead of rolling back to the initial state. CRR based fault tolerance methods have been frequently employed in distributed systems, both in static and dynamic environments [1, 2]. Recently, there has been an increased research interest to support application execution in the cloud computing environment with checkpointing based fault tolerance [3] due to the frequent failures encountered in cloud computing [1]. Cloud computing provides flexible and on-demand hardware and software as a service to users. This computing paradigm provides a view of unlimited resources and hence can be used for executing larg