A Novel MapReduce Based k-Means Clustering
Data clustering is inevitable in today’s era of data deluge. k-Means is a popular partition based clustering technique. However, with the increase in size and complexity of data, it is no longer suitable. There is an urgent need to shift towards parallel
- PDF / 427,298 Bytes
- 9 Pages / 439.37 x 666.142 pts Page_size
- 11 Downloads / 280 Views
Abstract Data clustering is inevitable in today’s era of data deluge. k-Means is a popular partition based clustering technique. However, with the increase in size and complexity of data, it is no longer suitable. There is an urgent need to shift towards parallel algorithms. We present a MapReduce based k-Means clustering, which is scalable and fault tolerant. The major advantage of our proposed work is that it dynamically determines the number of clusters, unlike k-Means where the final number of clusters has to be specified. MapReduce jobs are iteration sensitive as multiple read and write to the file system increase the cost as well as computation time. The algorithm proposed is not iterative one, it reads the data from and writes the output back to the file system once. We show that the proposed algorithm performs better than an Improved MapReduce based k-Means clustering algorithm. Keywords Davies-Bouldin index
⋅
MapReduce
⋅
Clustering
⋅
k-Means
1 Introduction There is a remarkable growth in data generation from multiple sources such as social media, business enterprises, sensors and so on [1, 2]. The IDC Digital Universe study estimates that the amount of digital data will grow to 40 zettabytes by 2020 and this will be doubled each year [3]. Acquiring knowledge hidden in this huge amount of data is a big challenge. Clustering is a powerful unsupervised learning data mining technique which is very useful for this purpose [4]. However, as the size of data is very large, single machine is no longer suitable for clustering. The need is to make a gradual shift towards the use of multiple machines with Ankita Sinha (✉) ⋅ P.K. Jana Department of Computer Science and Engineering, Indian School of Mines, Dhanbad, India e-mail: [email protected] P.K. Jana e-mail: [email protected] © Springer Science+Business Media Singapore 2017 J.K. Mandal et al. (eds.), Proceedings of the First International Conference on Intelligent Computing and Communication, Advances in Intelligent Systems and Computing 458, DOI 10.1007/978-981-10-2035-3_26
247
248
Ankita Sinha and P.K. Jana
distributed storage and distributed processing [5]. Clustering divides the data into groups called clusters, where objects in one cluster are more similar than the objects belonging to other clusters. One of the most popular and commonly used clustering algorithm is k-Means. It is very simple to implement and robust. However, there are some inherent weaknesses in k-Means clustering. For examples, the user has to specify the number of clusters before the start of the algorithm, which may lead to the resolution problem. The generated clusters are also very much sensitive to the selection of initial seeds [6]. Therefore, many algorithms [7, 8, 9], have been reported to improve the performance of the k-Means. However, all such algorithms have been developed for standalone systems and hence they are very slow to process large size data. Therefore many efforts has been made to implement k-Means on multiple machines [6, 10–13], some of which are
Data Loading...