An entropy-based initialization method of K -means clustering on the optimal number of clusters

  • PDF / 4,751,701 Bytes
  • 18 Pages / 595.276 x 790.866 pts Page_size
  • 36 Downloads / 207 Views

DOWNLOAD

REPORT


(0123456789().,-volV)(0123456789(). ,- volV)

ORIGINAL ARTICLE

An entropy-based initialization method of K-means clustering on the optimal number of clusters Kuntal Chowdhury1 • Debasis Chaudhuri2 • Arup Kumar Pal1 Received: 12 February 2020 / Accepted: 26 October 2020 Ó Springer-Verlag London Ltd., part of Springer Nature 2020

Abstract Clustering is an unsupervised learning approach used to group similar features using specific mathematical criteria. This mathematical criterion is known as the objective function. Any clustering is done depending on some objective function. Kmeans is one of the widely used partitional clustering algorithms whose performance depends on the initial point and the value of K. In this paper, we have combined both these parameters. We have defined an entropy-based objective function for the initialization process, which is better than other existing initialization methods of K-means clustering. Here, we have also designed an algorithm to calculate the correct number of clusters of datasets using some cluster validity indexes. In this paper, the entropy-based initialization algorithm has been proposed and applied to different 2D and 3D data sets. The comparison with other existing initialization methods has been represented in this paper. Keywords Clustering  Cluster validity indexes  Unsupervised  K-means

1 Introduction Clustering is known as unsupervised learning, where the given data are grouped into classes according to the criteria function [18]. It is also known as the favored technique of assigning a given data into similar classes depending on specific features. The clusters correspond to hidden patterns, and the search for it is unsupervised learning by considering the machine learning perspective. Algorithms and methods for clustering analysis provide core techniques for handling the numerous applications, such as information retrieval, text mining [5], weblog analysis [39], etc. The choice of the number of clusters and the seed point’s initial position are the essential factors for the & Kuntal Chowdhury [email protected] Debasis Chaudhuri [email protected] Arup Kumar Pal [email protected] 1

Department of CSE, Indian Institute of Technology (Indian School of Mines) [IIT(ISM)], Dhanbad, Jharkhand, India

2

Deputy General Manager, DRDO Integration Centre, Panagarh, West Bengal, India

partitional clustering algorithms to produce the qualitative clusters. K-means algorithm can be applied to any large datasets with the prior value of K [17]. Literature surveys reveal the different methods for the automatic detection of the optimal value of K [10, 32, 40]. Another important application of optimality in clustering is wireless sensor networks to increase energy-efficient data transmission and provide the solution to prolong the network lifetime [19].

1.1 Similar literature on initialization algorithms This section has described the different works regarding the initial seed selection of the K-means algorithm. To achieve the global optimum results