Tails in the cloud: a survey and taxonomy of straggler management within large-scale cloud data centres
- PDF / 954,525 Bytes
- 40 Pages / 439.37 x 666.142 pts Page_size
- 100 Downloads / 174 Views
Tails in the cloud: a survey and taxonomy of straggler management within large‑scale cloud data centres Sukhpal Singh Gill1 · Xue Ouyang2 · Peter Garraghan3
© Springer Science+Business Media, LLC, part of Springer Nature 2020
Abstract Cloud computing systems are splitting compute- and data-intensive jobs into smaller tasks to execute them in a parallel manner using clusters to improve execution time. However, such systems at increasing scale are exposed to stragglers, whereby abnormally slow running tasks executing within a job substantially affect job performance completion. Such stragglers are a direct threat towards attaining fast execution of data-intensive jobs within cloud computing. Researchers have proposed an assortment of different mechanisms, frameworks, and management techniques to detect and mitigate stragglers both proactively and reactively. In this paper, we present a comprehensive review of straggler management techniques within large-scale cloud data centres. We provide a detailed taxonomy of straggler causes, as well as proposed management and mitigation techniques based on straggler characteristics and properties. From this systematic review, we outline several outstanding challenges and potential directions of possible future work for straggler research. Keywords Computing · Stragglers · Cloud computing · Straggler management · Distributed systems · Cloud data centres
* Sukhpal Singh Gill [email protected] Xue Ouyang [email protected] Peter Garraghan [email protected] 1
School of Electronic Engineering and Computer Science, Queen Mary University of London, London, UK
2
School of Electronic Sciences, National University of Defense Technology, Changsha, China
3
School of Computing and Communications, Lancaster University, Lancaster, UK
13
Vol.:(0123456789)
S. S. Gill et al.
1 Introduction and motivation Nowadays, applications spanning various domains including social networks, e-commerce sites, and healthcare generate vast quantities of data. The growing velocity and volume of such data generation has subsequently required the substantial computing capacity in order to store and process such data effectively [1]. Such large-scale computing systems, encompassing data centre clusters, comprise hundreds and thousands of individual machines interconnected together that underpin application operation consumed by both businesses and consumers alike. A combination of increasing application demand and technological innovations has resulted in greater system scale in the regions of tens of thousands of servers within an individual cluster [2]. However, such complexity has subsequently resulted in an increase in complexity within such systems, manifesting in the form of emergent phenomena whereby system operation exhibits behaviour unforeseen at design time. Such emergent phenomenon manifesting within large-scale cloud data centres has been observed to negatively impact application performance. One such phenomenon, known as the long-tail problem, is characterized by a
Data Loading...