Performance Analysis of Scheduling Algorithms in Apache Hadoop

Applications involving big data need enormous memory space to load the data and high processing power to execute them. Individually, the traditional computing systems are not sufficient to execute these big data applications, but cumulatively they can be

  • PDF / 773,425 Bytes
  • 13 Pages / 439.37 x 666.142 pts Page_size
  • 40 Downloads / 203 Views

DOWNLOAD

REPORT


1 Introduction Prevailing variety of applications are generating huge amount of data every day, which is much beyond our imagination. Knowingly or unknowingly, we are generating or working with big data. Data these days is no more static in nature, but it is very dynamic. Therefore, the challenge lies not just in processing big data but also in storing, transmitting, and securing big data. Thus, big data application opens a new door for technology and identifies betterment of humanity. The term big data refers to colossal and complex set of data that cannot be processed in a traditional way [1]. Apache Hadoop [2] is the most suitable open-source ecosystem for processing big data in a distributed manner. Google’s MapReduce [3] is the best proposed programming framework for big data processing solution under the umbrella of Hadoop. Hadoop is not just software but it is a framework of tools for processing and analyzing big data. Big data not only demands faster processing, but it also demands better analysis, security, authenticity, scalability, and more. One of the important aspects of any processing tools is to process it faster. MapReduce satisfies most of the big data processing demands such as scalability, fault tolerance, faster processing, and optimization. However, MapReduce has some limitations while considering performance and efficiency of big data. By considering this fact, many researchers and industries have worked on to overcome the limitation of MapReduce model. The goal of this study is to measure the performance of various scheduling algorithms on different big data applications. The paper discusses big data processing performed using A. Shah (B) Shankersinh Vaghela Bapu Institute of Technology, Gandhinagar, India e-mail: [email protected] M. Padole The Maharaja Sayajirao University of Baroda, Vadodara, India e-mail: [email protected] © Springer Nature Singapore Pte Ltd. 2019 R. K. Shukla et al. (eds.), Data, Engineering and Applications, https://doi.org/10.1007/978-981-13-6351-1_5

45

46

A. Shah and M. Padole

Hadoop/MapReduce model. This study aims to identify better scheduling model depending upon big data application that needs to be processed.

2 Hadoop Ecosystem Hadoop is open-source software comprising of framework of tools. These tools provide support for executing big data applications. Hadoop has very simple architecture. Hadoop 2.0 version primarily consists of three components as shown in Fig. 1. 1. HDFS (Hadoop Distributed File System): It provides distributed storage of data over Hadoop environment. It stores data and metadata separately. 2. YARN (Yet Another Resource Negotiator): YARN is responsible for managing the resources of Hadoop cluster. 3. MapReduce: It is the programming model on top of YARN responsible for processing of data in the Hadoop environment. It performs the computation.

2.1 HDFS Hadoop HDFS has master/slave architecture. Master node has two components called resource manager and namenode. Slave on each node of a cluster is having node manager and