A Performance Analysis of High-Level MapReduce Query Languages in Big Data
The current era is an era of big data analytics. One of the challenges of big data is mining of the relevant data out of huge volume of databases where the data is present in variety of formats. MapReduce is providing a viable solution to analyze this typ
- PDF / 188,402 Bytes
- 8 Pages / 439.37 x 666.142 pts Page_size
- 35 Downloads / 188 Views
Abstract The current era is an era of big data analytics. One of the challenges of big data is mining of the relevant data out of huge volume of databases where the data is present in variety of formats. MapReduce is providing a viable solution to analyze this type of data, but it has some limitations and weaknesses too. Hence, the high-level query languages have evolved for querying massive amount of data over MapReduce. In this research paper, the authors have analyzed the performance of the three prominent high-level query languages viz. Pig Latin, HiveQL, and JAQL based on the query processing time. We have first stored data in the Hadoop distributed file system, processed the data for wordcount, and web log processing benchmarks and then analyzed it. An experimental analysis of the three languages has been performed on unstructured data format by doubling the size of the dataset. Keywords High-level query languages
Pig Hive JAQL Hadoop Big data
1 Introduction The current era is an age of digital revolution. The emerging trend toward the digital services and technology is to digitize every minute information. With the growth of the internet, global communication, and networking has increased. As a result, the need of storage, transmission, and accessing this information or data has become very significant. Over the past few years, there has been tremendous increase in the volume of data. This has given rise to the term big data. Big data has been widely used to describe about the exponential growth of the data with respect to variety, volume and velocity and thus has become one of the major areas of research and analytics Namrata Singh (&) Sanjay Agrawal Department of Computer Engineering and Applications, National Institute of Technical Teachers’ Training and Research, Bhopal, India e-mail: [email protected] Sanjay Agrawal e-mail: [email protected] © Springer Science+Business Media Singapore 2016 S.C. Satapathy et al. (eds.), Proceedings of the International Congress on Information and Communication Technology, Advances in Intelligent Systems and Computing 438, DOI 10.1007/978-981-10-0767-5_57
551
552
Namrata Singh and Sanjay Agrawal
now-a-days. The key contributors to the growth of this data are the internet, social media, sensors, smart phones, etc. This data needs to be stored and processed. The traditional storage and processing mechanisms like the relational database management systems have failed to process this large amount of data. This big data problem is now being handled by various technologies like NoSQL databases [1], Hadoop [2], etc. These technologies provide an effective platform for dealing with the enormous amount of data, which needs to be effectively gathered, processed, and analyzed. Among them, Hadoop is one of the technologies which can be used to deal with various types of data. Since the data is originating from various domains, analytics has become a great challenge for big data. This data is very valuable and acts as a crucial component in analysis as the
Data Loading...