Parallel Bat Algorithm-Based Clustering Using MapReduce

As we are going through the era of big data where the size of the data is increasing very rapidly resulting into the failure of traditional clustering methods on such a massive data sets. If the size of data exceeds the storage capacity or memory of the s

PDF / 176,371 Bytes
10 Pages / 439.37 x 666.142 pts Page_size
32 Downloads / 288 Views

DOWNLOAD

REPORT

Abstract As we are going through the era of big data where the size of the data is increasing very rapidly resulting into the failure of traditional clustering methods on such a massive data sets. If the size of data exceeds the storage capacity or memory of the system, the task of clustering will become more complex and time intensive. To overcome this problem, this paper proposes a fast and eﬃcient parallel bat algorithm (PBA) for the data clustering using the map-reduce architecture. Eﬃcient using the evolutionary approach for clustering purpose rather than using traditional algorithm like k-means and fast by paralyzing it using the Hadoop and map-reduce architecture. The PBA algorithm works by dividing the large data set into small blocks and clustering these smaller data blocks in parallel. The proposed algorithm inherits the bat algorithm features to cluster the data set. The proposed algorithm is validated on ﬁve benchmark data sets against particle swarm optimization with diﬀerent number of nodes. Experimental results show that the PBA algorithm is giving competitive results as compared to the particle swarm optimization and also providing the signiﬁcant speedup with increasing number of nodes. Keywords Bat algorithm ⋅ Parallel bat algorithm ⋅ Map-reduce ⋅ Hadoop

T. Ashish (✉) ⋅ S. Kapil ⋅ B. Manju Jaypee Institute of Information Technology Noida, Delhi Technological University Delhi, IP College of Women Delhi, New Delhi, India e-mail: [email protected]; [email protected]; [email protected] S. Kapil e-mail: [email protected] B. Manju e-mail: [email protected] © Springer Nature Singapore Pte Ltd. 2018 G.M. Perez et al. (eds.), Networking Communication and Data Knowledge Engineering, Lecture Notes on Data Engineering and Communications Technologies 4, https://doi.org/10.1007/978-981-10-4600-1_7

73

74

T. Ashish et al.

1 Introduction Clustering is a popular analysis technique in data science, used in many applications and disciplines. Based on the values of various attributes of objects, it is used as an important tool and task to identify the homogeneous groups of the same. Clustering can be of following two types: hierarchal and partitioning. Hierarchal clustering works on two techniques, division and agglomeration of data clusters. Division is breaking large clusters into smaller ones, and agglomeration is merging small ones into nearest cluster. While in partition-based clustering, center of each cluster is used to compute an objective function and the value of this function is optimized by updating the center of clusters called as centroids. Clustering has a wide application in problems of data mining, data compression, pattern recognition, and machine learning. K-means is clustering algorithm which works on the greedy principle. It partitions the n data samples into k-clusters to minimize the sum of Euclidean distance of all data samples from their cluster centers. However, major drawbacks of this algorithm are as follows: ∙ No proper method to initialize. Generally done randomly. ∙ Due to h

Data Loading...

Parallel Bat Algorithm-Based Clustering Using MapReduce

Recommend Documents

Big Data Clustering Using MapReduce Framework: A Review

Parallel knowledge acquisition algorithms for big data using MapReduce

A Novel MapReduce Based k-Means Clustering

MapReduce

Efficient MapReduce Framework Using Summation

An Analysis of Distributed Document Clustering Using MapReduce Based K -Means Algorithm

Bat population recoveries give insight into clustering strategies during hibernation

Hybrid Version of Apriori Using MapReduce

Quantum-Behaved Particle Swarm Optimization Using MapReduce

Bat Bugs

A survey on parallel clustering algorithms for Big Data

Bat Bioacoustics