Improving the Map and Shuffle Phases in Hadoop MapReduce
Massive amounts of data are needed to be processed as analysis is becoming a challenging issue for network-centric applications in data management. Advanced tools are required for processing such data sets for analyzing. As a proficient analogous computin
- PDF / 347,976 Bytes
- 10 Pages / 439.37 x 666.142 pts Page_size
- 103 Downloads / 212 Views
Abstract Massive amounts of data are needed to be processed as analysis is becoming a challenging issue for network-centric applications in data management. Advanced tools are required for processing such data sets for analyzing. As a proficient analogous computing programming representation, MapReduce and Hadoop are employed for extensive data analysis applications. However, MapReduce still suffers with performance problems and MapReduce uses a shuffle phase as a featured element for logical I/O strategy. The map phase requires an improvement in its performance as this phase’s output acts as an input to the next phase. Its result reveals the efficiency, so map phase needs some intermediate checkpoints which regularly monitor all the splits generated by intermediate phases. MapReduce model is designed in a way that there is a need to wait until all maps accomplish their given task. This acts as a barrier for effective resource utilization. This paper implements shuffle as a service component to decrease the overall execution time of jobs, monitor map phase by skew handling, and increase resource utilization in a cluster. Keywords MapReduce Data analytics HDFS
Hadoop Shuffle Big data
1 Introduction The objective of data analytics is scrutinizing, cleansing, renovating, and molding of the data for extracting functional information, portentous termination and sustaining choice making [1]. Data analysis has various sides and looming methods beneath diverse identities in special business, science and social science fields [2]. Big data is a meticulous technique of data analysis that focuses on analyzing huge data sets which materialize from various fields of intensive informatics data J. V. N. Lakshmi (&) AIMS Institutes of Higher Education, Peenya, Bengaluru, Karnataka, India e-mail: [email protected] © Springer Nature Singapore Pte Ltd. 2018 S. C. Satapathy et al. (eds.), Smart Computing and Informatics, Smart Innovation, Systems and Technologies 77, https://doi.org/10.1007/978-981-10-5544-7_21
203
204
J. V. N. Lakshmi
centers [3]. Big data typically comprises of data sets of massive volume beyond the skill of traditional software tools to analyze, handle, and process the data [4]. Procedures written in this practical way are mechanically parallelized and implemented on an immense cluster of commodity equipment [5, 6]. In program execution, runtime structures are concerned of the splits which are scheduled in handling many operations such as implementation across set of machines, managing failures, and handling inter-machine communications [7]. The crucial drawback is exhibited on Hadoop performance affecting the cluster. The significant explanation of Hadoop is outlined as below: (1) Distinct phases are leaped into a single task—the implementation of reduce function is CPU intensive and memory intensive as to segregate the map task data and produce the absolute outcome. (2) Arbitrary requests from I/O effecting the shuffle phase—task tracker receives plenty of I/O reading requests. Each request w
Data Loading...