An Analysis of Distributed Document Clustering Using MapReduce Based K -Means Algorithm
- PDF / 852,229 Bytes
- 10 Pages / 595.276 x 790.866 pts Page_size
- 9 Downloads / 185 Views
ORIGINAL CONTRIBUTION
An Analysis of Distributed Document Clustering Using MapReduce Based K-Means Algorithm Tanvir Habib Sardar1
•
Zahid Ansari2
Received: 8 November 2018 / Accepted: 28 August 2020 The Institution of Engineers (India) 2020
Abstract Clustering is considered as one of the important data mining techniques. Document clustering is among many applications of clustering. The traditional clustering algorithms are proven inefficient for clustering rapidly generating large real world datasets. As a solution, traditional clustering algorithms are modified using distributed programming paradigm. MapReduce is a popular distributed programming paradigm designed for Hadoop distributed framework. This paper demonstrates a MapReduce based modification of K-Means clustering algorithm for document datasets. The result shows that the proposed algorithm is efficient than traditional K-Means for all size of document datasets clustering. The experiments also show that the MapReduce clustering works more efficiently when the dataset size and Hadoop cluster sizes are large. Keywords MapReduce Hadoop Parallel K-means Document clustering Distributed computing
Introduction Data mining is a process to obtain useful knowledge from raw datasets [1]. Clustering is a well-known data mining technique which groups similar data objects from a dataset using similarity among the data objects [2]. The clustering & Zahid Ansari [email protected] Tanvir Habib Sardar [email protected] 1
School of Computer Science and Engineering, Jain University, Bengaluru, India
2
P.A. College of Engineering, Mangaluru, India
algorithms are used in text datasets for implicit grouping of similar documents based on the occurrence of the most similar words among the documents [3]. The document datasets are processed to obtain a set of unique words. The more common the words in two text files the more similar the text files and thus, more it claims to be in same group [4]. The application of document clustering are many such as to organize documents in hierarchy, to find out a particular text file from a document dataset of multiple folders and text files, to filter text file information, to name a few [5, 6]. K-Means is the most widely used clustering algorithm due to its simplicity and usefulness [7]. K-Means is a partition based clustering algorithm which categorize the input dataset objects into pre-specified number of groups, i.e. K. In K-Means, firstly K numbers of objects are randomly chosen from the dataset (known as centroid or cluster centres) then the distance of each object in the dataset are measured with these cluster centres. The resulting distance provides the basis of similarity between every two objects in the dataset. If an object is most similar (least in distance) to any of K cluster centres then that object is assigned to the group of that cluster centre. After this the mean value of the newly formed group (that belongs to each cluster centre) is taken and these K new mean values become new cluste
Data Loading...