Grouping-Aware Data Placement in HDFS for Data-Intensive Applications Based on Graph Clustering

The time taken to execute a query and return the results, increase exponentially as the data size increases, leading to more waiting times of the user. Hadoop with its distributed processing capability can be considered as an efficient solution for proces

PDF / 705,426 Bytes
11 Pages / 439.37 x 666.142 pts Page_size
78 Downloads / 280 Views

DOWNLOAD

REPORT

Abstract The time taken to execute a query and return the results, increase exponentially as the data size increases, leading to more waiting times of the user. Hadoop with its distributed processing capability can be considered as an eﬃcient solution for processing such large data. Hadoop’s default data placement strategy (HDDPS) places the data blocks randomly across the cluster of nodes without considering any of the execution parameters. Also, it is commonly observed that most of the data-intensive applications show grouping semantics. During any query execution only a part of the big data set is utilized. Since such grouping behavior is not considered, the default placement does not perform well, leading to increased execution time, query latency, etc. Hence an optimal data placement strategy based on grouping semantics is proposed. Initially by analyzing the user history log, the access pattern is identiﬁed and depicted as an execution graph. By applying Markov clustering algorithm, grouping pattern of the data is identiﬁed. Then optimal data placement algorithm based on statistical measures is proposed, which re-organizes the default data layouts in HDFS. This in turn increases parallel execution, resulting in improved data locality and reduced query execution time compared to HDDPS. The experimental results have strengthened the proposed algorithm and has proved to be more eﬃcient for Big-Data sets to be processed in hetrogenous distributed environment. Keywords Big data ⋅ Hadoop clustering ⋅ Data placement

⋅ Interest locality ⋅ Grouping semantics ⋅ Graph

S. Vengadeswaran (✉) ⋅ S.R. Balasundaram National Institute of Technology, Tiruchirappalli 620015, Tamil Nadu, India e-mail: [email protected] URL: http://www.nitt.edu S.R. Balasundaram e-mail: [email protected] © Springer Nature Singapore Pte Ltd. 2018 S.K. Bhatia et al. (eds.), Advances in Computer and Computational Sciences, Advances in Intelligent Systems and Computing 554, https://doi.org/10.1007/978-981-10-3773-3_3

21

22

S. Vengadeswaran and S.R. Balasundaram

1 Introduction Large volume of data is being generated every day in a variety of domains such as Social networks, Health care, Finance, Telecom, Government sectors etc., The data which these domains generate are voluminous (GB, PB, and TB), varied (structured, semi-structured, or unstructured) and ever increasing at an unprecedented pace. Big data is thus the term applied to such large volume of data sets whose size is beyond the ability of the commonly used software tools to capture, manage, and process within a tolerable elapsed time [1]. This deluge of data has led to the use of Hadoop to analyze and gain insights from the data. The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models [2–4]. HDFS is a ﬁlesystem designed for storing very large ﬁles reliably and streaming data with high bandwidth [5]. By optimizing the storage and processing of HDFS, the queries can b

Data Loading...

Grouping-Aware Data Placement in HDFS for Data-Intensive Applications Based on Graph Clustering

Recommend Documents

Graph-based Clustering

Graph Data Management in Scientific Applications

Data Analysis Based on Knowledge Graph

Adaptive multi-resolution graph-based clustering algorithm for electrofacies analysis

Clustering-based force-directed algorithms for 3D graph visualization

Cluster Structure Inference Based on Clustering Stability with Applications to Microarray Data Analysis

A Novel Graph Partitioning Criterion Based Short Text Clustering Method

Parallel Data Placement

Imbalanced Data Classification Method Based on Clustering and Voting Mechanism

Data Clustering Based on a New Objective Function

Multiprocessor Data Placement

Rough subspace-based clustering ensemble for categorical data