Rough Mode: A Generalized Centroid Proposal for Clustering Categorical Data Using the Rough Set Theory

Clustering is a widely used Data Mining method that aims to partition a given dataset into homogenous groups according to some predefined similarity criterion. The k-modes is a well known categorical clustering method that uses the notion of a mode to rep

PDF / 500,196 Bytes
12 Pages / 439.37 x 666.142 pts Page_size
85 Downloads / 220 Views

DOWNLOAD

REPORT

3

Polytechnic School of Tunisia, B.P. 743 Rue El khawarizmi, 2078 Al Marsá, Tunis, Tunisia [email protected] 2 Virtual Reality and Information Technologies, Military Academy of Fondouk Jedid, Tunis, Tunisia [email protected] Digital Research Center of Sfax, B.P. 275, 3021 Sakiet Ezzit, Sfax, Tunisia [email protected]

Abstract. Clustering is a widely used Data Mining method that aims to partition a given dataset into homogenous groups according to some predeﬁned similarity criterion. The k-modes is a well known categorical clustering method that uses the notion of a mode to represent the centroid in a partition during the clustering process. The mode is a vector containing the most frequent modalities for each attribute. However, in its original version, the mode is selected randomly in each iteration, although many other candidate modes can be proposed. In this paper, a new approach is developed aiming to generate potentially candidate modes for each cluster in each iteration using their relative density. The obtained modes will then be arranged into upper and lower approximation of the Rough Set Theory in order to identify the most pertinent ones. The effectiveness of the proposed method was tested using two real world datasets and compared to the standard k-modes and it was experimentally demonstrated that it provided higher accuracy. Keywords: Clustering categorical data k-modes

Data Mining Rough Set Theory

1 Introduction Clustering, also known as unsupervised learning is a pattern recognition technique used in various ﬁelds including computer science and vision. It is a complex task; in fact, the ﬁnal shape of the clusters couldn’t be determined in advance and no speciﬁc parameters are required expect the number of clusters K. The goal of clustering is to separate a ﬁnite unlabeled dataset into a ﬁnite and discrete set of clusters containing homogenous observations. In the literature, many clustering methods were proposed either for numeric data such as the k-means and its various variants [1–5] or categorical data such as the k-modes and its variants [4, 6–14]. In all these approaches, the clustering process takes into consideration the centroid of each cluster to swap from iteration i to iteration © Springer Nature Switzerland AG 2019 Y. Farhaoui and L. Moussaid (Eds.): ICBDSDE 2018, SBD 53, pp. 225–236, 2019. https://doi.org/10.1007/978-3-030-12048-1_24

226

S. Ben Salem et al.

(i + 1) in order to improve the clustering accuracy. The centroid of a categorical dataset is called mode and is built based on the most frequent modalities in each attribute without taking into consideration the fact that two modalities or more may have the same frequencies in the attribute. This random selection needs to be improved in a way to provide a more appropriate method to select the most suitable centroids. In this paper we propose to extend the notion of the mode using the Rough Set Theory (RST) in order to deﬁne multiple modes when creating the set of candidate modes. This issue takes into accoun

Data Loading...

Rough Mode: A Generalized Centroid Proposal for Clustering Categorical Data Using the Rough Set Theory

Recommend Documents

Rough subspace-based clustering ensemble for categorical data

Rough Set Theory (RST)

Rough Set Theory

Missing Concept Extraction Using Rough Set Theory

Basic Consideration of Co-Clustering Based on Rough Set Theory

Rough Set Theory: A True Landmark in Data Analysis

Decision Rule Mining in Rough Set Theory

Rough Set Theory, Granular Computing on Partition

A rough set method for the unicost set covering problem

Information Retrieval Using Rough Set Approximations

Inhibitory Rules in Data Analysis A Rough Set Approach

Granules-Based Rough Set Theory for Circuit Breaker Fault Diagnosis