Improved Clustering for Categorical Data with Genetic Algorithm

Clustering is the most significant unsupervised learning where the aim is to partition the data set into uniform groups called clusters. Many real-world data sets often contain categorical values, but many clustering algorithms work only on numeric values

  • PDF / 662,452 Bytes
  • 10 Pages / 439.37 x 666.142 pts Page_size
  • 10 Downloads / 217 Views

DOWNLOAD

REPORT


Abstract Clustering is the most significant unsupervised learning where the aim is to partition the data set into uniform groups called clusters. Many real-world data sets often contain categorical values, but many clustering algorithms work only on numeric values which limits its use in data mining. The k-modes algorithm is one of the very effective for proper partitions of categorical data sets, though the algorithm stops at locally optimum solution as depended on initial cluster centres. Proposed algorithm utilizes the genetic algorithm (GA) to optimize the k-modes clustering algorithm. The reason is, considering noise as cluster centres gives the high cost which will not fit for the next iteration and also not gets stuck to the suboptimal solutions. The superiority of proposed algorithm is demonstrated for several real-life data sets in terms of accuracy and proves it is efficient and can reveal encouraging results especially for the large datasets. Keywords Clustering

 Categorical data  Genetic algorithm  k-modes algorithm

1 Introduction The ever-growing data in almost all fields significantly contribute towards future decision-making, extracting hidden, but potentially useful information embedded in the data. In depth of the clustering problem, many clustering methods usually require the designer to provide the name and number of clusters as input. Unfortunately, the designer has no idea about the inherent structure of huge data sets. As well as clustering result is sensitive to the selection of the initial cluster centres. This sensitivity may make the algorithm converge to the local optima. So, the most challenging and difficult task is the determination of the number and name A. Sharma (&)  R. S. Thakur Maulana Azad National Institute of Technology, Bhopal, India e-mail: [email protected] R. S. Thakur e-mail: [email protected] © Springer Nature Singapore Pte Ltd. 2018 V. Nath (ed.), Proceedings of the International Conference on Microelectronics, Computing & Communication Systems, Lecture Notes in Electrical Engineering 453, https://doi.org/10.1007/978-981-10-5565-2_6

67

68

A. Sharma and R. S. Thakur

of clusters in a data set, which is a basic input parameter for most clustering algorithms. Clustering [1–3] is an important unsupervised classification technique which groups the data objects in database such a way that objects of similar pattern in some sense reside in one cluster and objects in different clusters are dissimilar in same sense [4, 5]. Clustering has been effectively applied on variety of engineering and scientific applications such as bio-informatics, astronomy, medical imaging, remote sensing, physics, etc. Data matrix and dissimilarity matrix are basically two types of data structure for clustering, if the data is not in this format then need to preprocess the data in above suitable format [6]. Clustering algorithm generally classified into two categories hierarchical and partitioning. Hierarchical clustering algorithm builds a hierarchy of partition at each level. This paper