An Analysis of Distributed Document Clustering Using MapReduce Based K -Means Algorithm

  • PDF / 852,229 Bytes
  • 10 Pages / 595.276 x 790.866 pts Page_size
  • 9 Downloads / 185 Views

DOWNLOAD

REPORT


ORIGINAL CONTRIBUTION

An Analysis of Distributed Document Clustering Using MapReduce Based K-Means Algorithm Tanvir Habib Sardar1



Zahid Ansari2

Received: 8 November 2018 / Accepted: 28 August 2020  The Institution of Engineers (India) 2020

Abstract Clustering is considered as one of the important data mining techniques. Document clustering is among many applications of clustering. The traditional clustering algorithms are proven inefficient for clustering rapidly generating large real world datasets. As a solution, traditional clustering algorithms are modified using distributed programming paradigm. MapReduce is a popular distributed programming paradigm designed for Hadoop distributed framework. This paper demonstrates a MapReduce based modification of K-Means clustering algorithm for document datasets. The result shows that the proposed algorithm is efficient than traditional K-Means for all size of document datasets clustering. The experiments also show that the MapReduce clustering works more efficiently when the dataset size and Hadoop cluster sizes are large. Keywords MapReduce  Hadoop  Parallel K-means  Document clustering  Distributed computing

Introduction Data mining is a process to obtain useful knowledge from raw datasets [1]. Clustering is a well-known data mining technique which groups similar data objects from a dataset using similarity among the data objects [2]. The clustering & Zahid Ansari [email protected] Tanvir Habib Sardar [email protected] 1

School of Computer Science and Engineering, Jain University, Bengaluru, India

2

P.A. College of Engineering, Mangaluru, India

algorithms are used in text datasets for implicit grouping of similar documents based on the occurrence of the most similar words among the documents [3]. The document datasets are processed to obtain a set of unique words. The more common the words in two text files the more similar the text files and thus, more it claims to be in same group [4]. The application of document clustering are many such as to organize documents in hierarchy, to find out a particular text file from a document dataset of multiple folders and text files, to filter text file information, to name a few [5, 6]. K-Means is the most widely used clustering algorithm due to its simplicity and usefulness [7]. K-Means is a partition based clustering algorithm which categorize the input dataset objects into pre-specified number of groups, i.e. K. In K-Means, firstly K numbers of objects are randomly chosen from the dataset (known as centroid or cluster centres) then the distance of each object in the dataset are measured with these cluster centres. The resulting distance provides the basis of similarity between every two objects in the dataset. If an object is most similar (least in distance) to any of K cluster centres then that object is assigned to the group of that cluster centre. After this the mean value of the newly formed group (that belongs to each cluster centre) is taken and these K new mean values become new cluste