A Novel MapReduce Based k-Means Clustering

Data clustering is inevitable in today’s era of data deluge. k-Means is a popular partition based clustering technique. However, with the increase in size and complexity of data, it is no longer suitable. There is an urgent need to shift towards parallel

PDF / 427,298 Bytes
9 Pages / 439.37 x 666.142 pts Page_size
11 Downloads / 306 Views

DOWNLOAD

REPORT

Abstract Data clustering is inevitable in today’s era of data deluge. k-Means is a popular partition based clustering technique. However, with the increase in size and complexity of data, it is no longer suitable. There is an urgent need to shift towards parallel algorithms. We present a MapReduce based k-Means clustering, which is scalable and fault tolerant. The major advantage of our proposed work is that it dynamically determines the number of clusters, unlike k-Means where the ﬁnal number of clusters has to be speciﬁed. MapReduce jobs are iteration sensitive as multiple read and write to the ﬁle system increase the cost as well as computation time. The algorithm proposed is not iterative one, it reads the data from and writes the output back to the ﬁle system once. We show that the proposed algorithm performs better than an Improved MapReduce based k-Means clustering algorithm. Keywords Davies-Bouldin index

⋅

MapReduce

⋅

Clustering

⋅

k-Means

1 Introduction There is a remarkable growth in data generation from multiple sources such as social media, business enterprises, sensors and so on [1, 2]. The IDC Digital Universe study estimates that the amount of digital data will grow to 40 zettabytes by 2020 and this will be doubled each year [3]. Acquiring knowledge hidden in this huge amount of data is a big challenge. Clustering is a powerful unsupervised learning data mining technique which is very useful for this purpose [4]. However, as the size of data is very large, single machine is no longer suitable for clustering. The need is to make a gradual shift towards the use of multiple machines with Ankita Sinha (✉) ⋅ P.K. Jana Department of Computer Science and Engineering, Indian School of Mines, Dhanbad, India e-mail: [email protected] P.K. Jana e-mail: [email protected] © Springer Science+Business Media Singapore 2017 J.K. Mandal et al. (eds.), Proceedings of the First International Conference on Intelligent Computing and Communication, Advances in Intelligent Systems and Computing 458, DOI 10.1007/978-981-10-2035-3_26

247

248

Ankita Sinha and P.K. Jana

distributed storage and distributed processing [5]. Clustering divides the data into groups called clusters, where objects in one cluster are more similar than the objects belonging to other clusters. One of the most popular and commonly used clustering algorithm is k-Means. It is very simple to implement and robust. However, there are some inherent weaknesses in k-Means clustering. For examples, the user has to specify the number of clusters before the start of the algorithm, which may lead to the resolution problem. The generated clusters are also very much sensitive to the selection of initial seeds [6]. Therefore, many algorithms [7, 8, 9], have been reported to improve the performance of the k-Means. However, all such algorithms have been developed for standalone systems and hence they are very slow to process large size data. Therefore many efforts has been made to implement k-Means on multiple machines [6, 10–13], some of which are

Data Loading...

A Novel MapReduce Based k-Means Clustering

Recommend Documents

Parallel Bat Algorithm-Based Clustering Using MapReduce

Big Data Clustering Using MapReduce Framework: A Review

A Novel Clustering Based Collaborative Filtering Recommendation System Algorithm

MapReduce

A Novel Graph Partitioning Criterion Based Short Text Clustering Method

Swarm intelligence based centralized clustering: a novel solution

An Analysis of Distributed Document Clustering Using MapReduce Based K -Means Algorithm

Green Master Based on MapReduce Cluster

A Solution of Traffic Problems Based on MapReduce

Query Caching Technique Over Cloud-Based MapReduce System: A Survey

A Novel Image Segmentation Based on Clustering and Population-Based Optimisation

A clustering-based approach to vortex extraction