An improved query optimization process in big data using ACO-GA algorithm and HDFS map reduce technique
- PDF / 2,165,003 Bytes
- 18 Pages / 439.37 x 666.142 pts Page_size
- 45 Downloads / 206 Views
An improved query optimization process in big data using ACO‑GA algorithm and HDFS map reduce technique Deepak Kumar1 · Vijay Kumar Jha1
© Springer Science+Business Media, LLC, part of Springer Nature 2020
Abstract Storing as well as retrieving the data on a specific time frame is fundamental for any application today. So an efficiently designed query permits the user to get results in the desired time and creates credibility for the corresponding application. To avoid the difficulty in query optimization, this paper proposed an improved query optimization process in big data (BD) using the ACO-GA algorithm and HDFS mapreduce. The proposed methodology consists of ‘2’ phases, namely, BD arrangement and query optimization phases. In the first phase, the input data is pre-processed by finding the hash value (HV) using the SHA-512 algorithm and the removal of repeated data using the HDFS map-reduce function. Then, features such as closed frequent pattern, support, and confidence are extracted. Next, the support and confidence are managed by using the entropy calculation. Centered on the entropy calculation, the related information is grouped by using Normalized K-Means (NKM) algorithm. In the 2nd phase, the BD queries are collected, and then the same features are extorted. Next, the optimized query is found by utilizing the ACO-GA algorithm. Finally, the similarity assessment process is performed. The experimental outcomes illustrate that the algorithm outperformed other existent algorithms. Keywords Secure Hash Algorithm (SHA-512) · Hadoop Distributed File System (HDFS) · Normalized K-Means (NKM) algorithm · Ant Colony OptimizationGenetic Algorithm (ACO-GA)
* Deepak Kumar [email protected] Vijay Kumar Jha [email protected] 1
Department of Computer Science and Engineering, Birla Institute of Technology Mesra, Ranchi, India
13
Vol.:(0123456789)
Distributed and Parallel Databases
1 Introduction The analysis of a large compilation of data is a routine action in numerous commercials along with academic organizations. Internet companies, for example, collect a massive quantity of data, say, content formed by means of service logs, web crawlers, along with click-streams [1], and some of the storage systems utilized by them are BD, cloud computing, etc. The data which are beyond the storage space of the server and also beyond the processing power is called BD [2, 3]. In this era, there is a requirement for software platforms to resolve dynamic multi-objective BD optimization problems [4]. A BD processing platform is by definition the computing platform for processing BD [5]. The present academic research together with industrial practices on data-bases emphasizes more on performance than energy efficiency [6]. It is not controllable by customary RDBMS [7, 8] or standard statistical tools. Scrutinizing these data sets might need processing tens or hundreds of terabytes of data. To perform this task, many companies rely on highly distributed software systems functioning on big clusters of commodity
Data Loading...