Job scheduler for streaming applications in heterogeneous distributed processing systems

  • PDF / 2,698,562 Bytes
  • 20 Pages / 439.37 x 666.142 pts Page_size
  • 60 Downloads / 199 Views

DOWNLOAD

REPORT


Job scheduler for streaming applications in heterogeneous distributed processing systems Ali Al‑Sinayyid1 · Michelle Zhu2

© Springer Science+Business Media, LLC, part of Springer Nature 2020

Abstract In this study, we investigated the problem of scheduling streaming applications on a heterogeneous cluster environment and, based on our previous work, developed the maximum throughput scheduler algorithm (MT-Scheduler) for streaming applications. The proposed algorithm uses a dynamic programming technique to efficiently map the application topology onto the heterogeneous distributed system based on computing and data transfer requirements, while also taking into account the capacity of the underlying cluster resources. The proposed approach maximizes the system throughput by identifying and minimizing the time incurred at the computing/ transfer bottleneck. The MT-Scheduler supports scheduling applications structured as a directed acyclic graph. We conducted experiments using three Storm microbenchmark topologies in both simulation and real Apache Storm environments. In terms of the performance evaluation, we compared the proposed MT-Scheduler with the simulated round robin and the default Storm scheduler algorithms. The results indicated that the MT-Scheduler outperforms the default round robin approach in terms of both the average system latency and throughput. Keywords  Apache Storm · Data stream · Distributed systems · Heterogeneous scheduling · DAG scheduling

1 Introduction At present, we live in the big data era, in which a variety of applications such as stock trading, banking systems, healthcare databases, IoT sensors, and social media networks [1] generate colossal amounts of real-time data. Such distributed data stream processing systems (DDSPSs) usually compute unbounded streams of data

* Ali Al‑Sinayyid [email protected] 1

Southern Illinois University Carbondale, Carbondale, IL, USA

2

Montclair State University, Montclair, NJ, USA



13

Vol.:(0123456789)



A. Al‑Sinayyid, M. Zhu

in real time, and they are dynamic in terms of their resource capacities [2, 3]. To realize such continuous data generation via streaming applications, the underlying distributed processing systems must perform prompt yet efficient management and analysis, especially in the case of heterogeneous systems [4, 5]. One of the key objectives of scheduling streaming applications is to maximize the frame rate, which corresponds to the number of instances of the datasets that can be processed per unit time. To achieve this goal, the scheduling algorithm must consider the data locality, resource heterogeneity, communicational aspects, and computational latencies. Data locality and location awareness factors arise due to the high data transfer latency in cases in which the data sources often reside in distant DDSPSs [6], which can negatively impact the system performance [7]. Researchers have addressed and solved this problem by performing the computing as close as possible to the data source [8]. An efficient mapping strategy should