Fine-grained data-locality aware MapReduce job scheduler in a virtualized environment

PDF / 3,625,807 Bytes
12 Pages / 595.276 x 790.866 pts Page_size
91 Downloads / 217 Views

ORIGINAL RESEARCH

Fine‑grained data‑locality aware MapReduce job scheduler in a virtualized environment Rathinaraja Jeyaraj1 · V. S. Ananthanarayana1 · Anand Paul2 Received: 15 April 2019 / Accepted: 7 January 2020 © Springer-Verlag GmbH Germany, part of Springer Nature 2020

Abstract Big data overwhelmed industries and research sectors. Reliable decision making is always a challenging task, which requires cost-effective big data processing tools. Hadoop MapReduce is being used to store and process huge volume of data in a distributed environment. However, due to huge capital investment and lack of expertise to set up an on-premise Hadoop cluster, big data users seek cloud-based MapReduce service over the Internet. Mostly, MapReduce on a cluster of virtual machines is offered as a service for a pay-per-use basis. Virtual machines in MapReduce virtual cluster reside in different physical machines and co-locate with other non-MapReduce VMs. This causes to share IO resources such as disk and network bandwidth, leading to congestion as most of the MapReduce jobs are disk and network intensive. Especially, the shuffle phase in MapReduce execution sequence consumes huge network bandwidth in a multi-tenant environment. This results in increased job latency and bandwidth consumption cost. Therefore, it is essential to minimize the amount of intermediate data in the shuffle phase rather than supplying more network bandwidth that results in increased service cost. Considering this objective, we extended multi-level per node combiner for a batch of MapReduce jobs to improve makespan. We observed that makespan is improved up to 32.4% by minimizing the number of intermediate data in shuffle phase when compared to classical schedulers with default combiners. Keywords MapReduce job scheduling · Combiner · Bandwidth minimization

1 Introduction Today, it is not possible to make reliable decisions without past data, which is becoming more substantial in size, called big data. Regardless of the size of industries and research sectors, big data has affected tremendously and demands to Electronic supplementary material The online version of this article (https://doi.org/10.1007/s12652-020-01707-7) contains supplementary material, which is available to authorized users. * Anand Paul [email protected] Rathinaraja Jeyaraj [email protected] V. S. Ananthanarayana [email protected] 1

Department of IT, National Institute of Technology Karnataka, Mangalore, India

School of Computer Science and Engineering, Kyungpook National University, 80‑Daehakro, Daegu, South Korea

2

increase the processing capabilities. There are many big data processing tools emerging, among which Hadoop MapReduce (Dean and Ghemawat 2004) is a cost-effective batch processing tool, which is widely available on-line as opensource. However, it is not feasible for small scale businesses and other entities to set up on-premise Hadoop cluster due to huge capital expenditure. Therefore, Hadoop MapReduce is offered as-a-service (Guo et al. 2017a, b

Data Loading...

Fine-grained data-locality aware MapReduce job scheduler in a virtualized environment

Recommend Documents

TMaR: a two-stage MapReduce scheduler for heterogeneous environments

Job scheduler for streaming applications in heterogeneous distributed processing systems

Towards Interference-Aware Dynamic Scheduling in Virtualized Environments

Scheduler

Reconfigurable and Traffic-Aware Access Schemes for Virtualized M2M Networks

Job Satisfaction in a Diverse Institutional Environment: The Brazilian Experience

Channel aware optimized proportional fair scheduler for LTE downlink

Neighbours-Aware Proportional Fair Scheduler for Future Wireless Networks

MapReduce

Load Balancing Approach for a MapReduce Job Running on a Heterogeneous Hadoop Cluster

A Proposed SDN-Based Cloud Setup in the Virtualized Environment to Enhance Security

VIOS: A Variation-Aware I/O Scheduler for Flash-Based Storage Systems