Parallel Bat Algorithm-Based Clustering Using MapReduce

As we are going through the era of big data where the size of the data is increasing very rapidly resulting into the failure of traditional clustering methods on such a massive data sets. If the size of data exceeds the storage capacity or memory of the s

  • PDF / 176,371 Bytes
  • 10 Pages / 439.37 x 666.142 pts Page_size
  • 30 Downloads / 250 Views

DOWNLOAD

REPORT


Abstract As we are going through the era of big data where the size of the data is increasing very rapidly resulting into the failure of traditional clustering methods on such a massive data sets. If the size of data exceeds the storage capacity or memory of the system, the task of clustering will become more complex and time intensive. To overcome this problem, this paper proposes a fast and efficient parallel bat algorithm (PBA) for the data clustering using the map-reduce architecture. Efficient using the evolutionary approach for clustering purpose rather than using traditional algorithm like k-means and fast by paralyzing it using the Hadoop and map-reduce architecture. The PBA algorithm works by dividing the large data set into small blocks and clustering these smaller data blocks in parallel. The proposed algorithm inherits the bat algorithm features to cluster the data set. The proposed algorithm is validated on five benchmark data sets against particle swarm optimization with different number of nodes. Experimental results show that the PBA algorithm is giving competitive results as compared to the particle swarm optimization and also providing the significant speedup with increasing number of nodes. Keywords Bat algorithm ⋅ Parallel bat algorithm ⋅ Map-reduce ⋅ Hadoop

T. Ashish (✉) ⋅ S. Kapil ⋅ B. Manju Jaypee Institute of Information Technology Noida, Delhi Technological University Delhi, IP College of Women Delhi, New Delhi, India e-mail: [email protected]; [email protected]; [email protected] S. Kapil e-mail: [email protected] B. Manju e-mail: [email protected] © Springer Nature Singapore Pte Ltd. 2018 G.M. Perez et al. (eds.), Networking Communication and Data Knowledge Engineering, Lecture Notes on Data Engineering and Communications Technologies 4, https://doi.org/10.1007/978-981-10-4600-1_7

73

74

T. Ashish et al.

1 Introduction Clustering is a popular analysis technique in data science, used in many applications and disciplines. Based on the values of various attributes of objects, it is used as an important tool and task to identify the homogeneous groups of the same. Clustering can be of following two types: hierarchal and partitioning. Hierarchal clustering works on two techniques, division and agglomeration of data clusters. Division is breaking large clusters into smaller ones, and agglomeration is merging small ones into nearest cluster. While in partition-based clustering, center of each cluster is used to compute an objective function and the value of this function is optimized by updating the center of clusters called as centroids. Clustering has a wide application in problems of data mining, data compression, pattern recognition, and machine learning. K-means is clustering algorithm which works on the greedy principle. It partitions the n data samples into k-clusters to minimize the sum of Euclidean distance of all data samples from their cluster centers. However, major drawbacks of this algorithm are as follows: ∙ No proper method to initialize. Generally done randomly. ∙ Due to h