An Analysis of Distributed Document Clustering Using MapReduce Based K -Means Algorithm

PDF / 852,229 Bytes
10 Pages / 595.276 x 790.866 pts Page_size
9 Downloads / 202 Views

ORIGINAL CONTRIBUTION

An Analysis of Distributed Document Clustering Using MapReduce Based K-Means Algorithm Tanvir Habib Sardar1

•

Zahid Ansari2

Received: 8 November 2018 / Accepted: 28 August 2020 The Institution of Engineers (India) 2020

Abstract Clustering is considered as one of the important data mining techniques. Document clustering is among many applications of clustering. The traditional clustering algorithms are proven inefficient for clustering rapidly generating large real world datasets. As a solution, traditional clustering algorithms are modified using distributed programming paradigm. MapReduce is a popular distributed programming paradigm designed for Hadoop distributed framework. This paper demonstrates a MapReduce based modification of K-Means clustering algorithm for document datasets. The result shows that the proposed algorithm is efficient than traditional K-Means for all size of document datasets clustering. The experiments also show that the MapReduce clustering works more efficiently when the dataset size and Hadoop cluster sizes are large. Keywords MapReduce Hadoop Parallel K-means Document clustering Distributed computing

Introduction Data mining is a process to obtain useful knowledge from raw datasets [1]. Clustering is a well-known data mining technique which groups similar data objects from a dataset using similarity among the data objects [2]. The clustering & Zahid Ansari [email protected] Tanvir Habib Sardar [email protected] 1

School of Computer Science and Engineering, Jain University, Bengaluru, India

2

P.A. College of Engineering, Mangaluru, India

algorithms are used in text datasets for implicit grouping of similar documents based on the occurrence of the most similar words among the documents [3]. The document datasets are processed to obtain a set of unique words. The more common the words in two text files the more similar the text files and thus, more it claims to be in same group [4]. The application of document clustering are many such as to organize documents in hierarchy, to find out a particular text file from a document dataset of multiple folders and text files, to filter text file information, to name a few [5, 6]. K-Means is the most widely used clustering algorithm due to its simplicity and usefulness [7]. K-Means is a partition based clustering algorithm which categorize the input dataset objects into pre-specified number of groups, i.e. K. In K-Means, firstly K numbers of objects are randomly chosen from the dataset (known as centroid or cluster centres) then the distance of each object in the dataset are measured with these cluster centres. The resulting distance provides the basis of similarity between every two objects in the dataset. If an object is most similar (least in distance) to any of K cluster centres then that object is assigned to the group of that cluster centre. After this the mean value of the newly formed group (that belongs to each cluster centre) is taken and these K new mean values become new cluste

Data Loading...

An Analysis of Distributed Document Clustering Using MapReduce Based K -Means Algorithm

Recommend Documents

A Novel MapReduce Based k-Means Clustering

Parallel Bat Algorithm-Based Clustering Using MapReduce

Clustering Analysis of Extreme Temperature Based on K-means Algorithm

Analysis of Financial Needs of New Agricultural Operators Based on K-Means Clustering Algorithm

k-Means Clustering

An Analysis of K-Means, Particle Swarm Optimization and Genetic Algorithm with Data Clustering Technique

A Quantum-Inspired Genetic K-Means Algorithm for Gene Clustering

Applying K-means Clustering and Genetic Algorithm for Solving MTSP

Distributed Clustering Algorithm

K-means tree: an optimal clustering tree for unsupervised learning

A Color Image Segmentation Method Based on Improved K-Means Clustering Algorithm

Improved Water Cycle Algorithm and K-Means Based Method for Data Clustering