Rough subspace-based clustering ensemble for categorical data

PDF / 997,139 Bytes
16 Pages / 595.276 x 790.866 pts Page_size
39 Downloads / 407 Views

METHODOLOGIES AND APPLICATION

Rough subspace-based clustering ensemble for categorical data Can Gao • Witold Pedrycz • Duoqian Miao

Published online: 10 January 2013 Springer-Verlag Berlin Heidelberg 2013

Abstract Clustering categorical data arising as an important problem of data mining has recently attracted much attention. In this paper, the problem of unsupervised dimensionality reduction for categorical data is first studied. Based on the theory of rough sets, the attributes of categorical data are decomposed into a number of rough subspaces. A novel clustering ensemble algorithm based on rough subspaces is then proposed to deal with categorical data. The algorithm employs some of rough subspaces with high quality to cluster the data and yields a robust and stable solution by exploiting the resulting partitions. We also introduce a cluster index to evaluate the solution of clustering algorithm for categorical data. Experimental results for selected UCI data sets show that the proposed method produces better results than those obtained by other methods when being evaluated in terms of cluster validity indexes. Keywords Categorical data Rough sets Fuzzy k-modes Clustering ensemble Cluster cardinality index

Communicated by A. Di Nola. C. Gao D. Miao Department of Computer Science and Technology, Tongji University, Shanghai 201804, People’s Republic of China C. Gao (&) W. Pedrycz Department of Electrical and Computer Engineering, University of Alberta, Edmonton, AB T6G 2G7, Canada e-mail: [email protected] W. Pedrycz System Research Institute, Polish Academy of Sciences, Warsaw, Poland

1 Introduction Clustering is unsupervised learning when the data at hand are unlabeled. The essence of clustering is to partition a given set of unlabeled data into several clusters, in which the objects located within the same cluster are similar to each other, but quite dissimilar from those forming some other clusters. A large variety of clustering algorithms such as C-means (Ball and Hall 1967; Anderberg 1973; Jain 2010) and fuzzy C-means (FCM) (Bezdek 1981; Pedrycz 1996; Bargiela and Pedrycz 2005; Pedrycz et al. 2010) have been proposed and been widely used in real-world domains including data mining, information retrieval, machine learning and many others (Jain and Dubes 1988; Pedrycz 2005). Actually, clustering is a demanding combinatorial optimization task and no single clustering algorithm is capable of delivering sound solutions for all data sets. Clustering ensemble (Ghaemi et al. 2009; Li et al. 2010; Vega-Pons and Ruiz-Shulcloper 2011), inspired by the idea of classifier ensemble encountered in supervised learning, has emerged as a technique for overcoming the problems associated with the individual clustering algorithms, such as robustness (Topchy et al. 2005), stability (Kuncheva and Vetrov 2006), parallelization (Tumer and Agogino 2008), and scalability (Hore et al. 2009), and has consequently found its applications in bioinformatics (Monti et al. 2003; Yu et al. 2007, 2011), image segmentation (J

Data Loading...

Rough subspace-based clustering ensemble for categorical data

Recommend Documents

Rough Mode: A Generalized Centroid Proposal for Clustering Categorical Data Using the Rough Set Theory

Cluster-Based Ensemble Using Distributed Clustering Approach for Large Categorical Data

Ensemble Similarity Clustering Frame work for Categorical Dataset Clustering Using Swarm Intelligence

Improved Clustering for Categorical Data with Genetic Algorithm

Categorical data

Visualizing Categorical Data

Two-stage pruning method for gram-based categorical sequence clustering

Longitudinal Categorical Data Analysis

Analysis of Categorical Data

Fuzzy Sets, Rough Sets, Multisets and Clustering

A Reasonable Rough Approximation for Clustering Web Users

Lectures on Categorical Data Analysis