Rough Mode: A Generalized Centroid Proposal for Clustering Categorical Data Using the Rough Set Theory

Clustering is a widely used Data Mining method that aims to partition a given dataset into homogenous groups according to some predefined similarity criterion. The k-modes is a well known categorical clustering method that uses the notion of a mode to rep

  • PDF / 500,196 Bytes
  • 12 Pages / 439.37 x 666.142 pts Page_size
  • 85 Downloads / 201 Views

DOWNLOAD

REPORT


3

Polytechnic School of Tunisia, B.P. 743 Rue El khawarizmi, 2078 Al Marsá, Tunis, Tunisia [email protected] 2 Virtual Reality and Information Technologies, Military Academy of Fondouk Jedid, Tunis, Tunisia [email protected] Digital Research Center of Sfax, B.P. 275, 3021 Sakiet Ezzit, Sfax, Tunisia [email protected]

Abstract. Clustering is a widely used Data Mining method that aims to partition a given dataset into homogenous groups according to some predefined similarity criterion. The k-modes is a well known categorical clustering method that uses the notion of a mode to represent the centroid in a partition during the clustering process. The mode is a vector containing the most frequent modalities for each attribute. However, in its original version, the mode is selected randomly in each iteration, although many other candidate modes can be proposed. In this paper, a new approach is developed aiming to generate potentially candidate modes for each cluster in each iteration using their relative density. The obtained modes will then be arranged into upper and lower approximation of the Rough Set Theory in order to identify the most pertinent ones. The effectiveness of the proposed method was tested using two real world datasets and compared to the standard k-modes and it was experimentally demonstrated that it provided higher accuracy. Keywords: Clustering categorical data k-modes

 Data Mining  Rough Set Theory 

1 Introduction Clustering, also known as unsupervised learning is a pattern recognition technique used in various fields including computer science and vision. It is a complex task; in fact, the final shape of the clusters couldn’t be determined in advance and no specific parameters are required expect the number of clusters K. The goal of clustering is to separate a finite unlabeled dataset into a finite and discrete set of clusters containing homogenous observations. In the literature, many clustering methods were proposed either for numeric data such as the k-means and its various variants [1–5] or categorical data such as the k-modes and its variants [4, 6–14]. In all these approaches, the clustering process takes into consideration the centroid of each cluster to swap from iteration i to iteration © Springer Nature Switzerland AG 2019 Y. Farhaoui and L. Moussaid (Eds.): ICBDSDE 2018, SBD 53, pp. 225–236, 2019. https://doi.org/10.1007/978-3-030-12048-1_24

226

S. Ben Salem et al.

(i + 1) in order to improve the clustering accuracy. The centroid of a categorical dataset is called mode and is built based on the most frequent modalities in each attribute without taking into consideration the fact that two modalities or more may have the same frequencies in the attribute. This random selection needs to be improved in a way to provide a more appropriate method to select the most suitable centroids. In this paper we propose to extend the notion of the mode using the Rough Set Theory (RST) in order to define multiple modes when creating the set of candidate modes. This issue takes into accoun