Rough subspace-based clustering ensemble for categorical data
- PDF / 997,139 Bytes
- 16 Pages / 595.276 x 790.866 pts Page_size
- 39 Downloads / 270 Views
METHODOLOGIES AND APPLICATION
Rough subspace-based clustering ensemble for categorical data Can Gao • Witold Pedrycz • Duoqian Miao
Published online: 10 January 2013 Springer-Verlag Berlin Heidelberg 2013
Abstract Clustering categorical data arising as an important problem of data mining has recently attracted much attention. In this paper, the problem of unsupervised dimensionality reduction for categorical data is first studied. Based on the theory of rough sets, the attributes of categorical data are decomposed into a number of rough subspaces. A novel clustering ensemble algorithm based on rough subspaces is then proposed to deal with categorical data. The algorithm employs some of rough subspaces with high quality to cluster the data and yields a robust and stable solution by exploiting the resulting partitions. We also introduce a cluster index to evaluate the solution of clustering algorithm for categorical data. Experimental results for selected UCI data sets show that the proposed method produces better results than those obtained by other methods when being evaluated in terms of cluster validity indexes. Keywords Categorical data Rough sets Fuzzy k-modes Clustering ensemble Cluster cardinality index
Communicated by A. Di Nola. C. Gao D. Miao Department of Computer Science and Technology, Tongji University, Shanghai 201804, People’s Republic of China C. Gao (&) W. Pedrycz Department of Electrical and Computer Engineering, University of Alberta, Edmonton, AB T6G 2G7, Canada e-mail: [email protected] W. Pedrycz System Research Institute, Polish Academy of Sciences, Warsaw, Poland
1 Introduction Clustering is unsupervised learning when the data at hand are unlabeled. The essence of clustering is to partition a given set of unlabeled data into several clusters, in which the objects located within the same cluster are similar to each other, but quite dissimilar from those forming some other clusters. A large variety of clustering algorithms such as C-means (Ball and Hall 1967; Anderberg 1973; Jain 2010) and fuzzy C-means (FCM) (Bezdek 1981; Pedrycz 1996; Bargiela and Pedrycz 2005; Pedrycz et al. 2010) have been proposed and been widely used in real-world domains including data mining, information retrieval, machine learning and many others (Jain and Dubes 1988; Pedrycz 2005). Actually, clustering is a demanding combinatorial optimization task and no single clustering algorithm is capable of delivering sound solutions for all data sets. Clustering ensemble (Ghaemi et al. 2009; Li et al. 2010; Vega-Pons and Ruiz-Shulcloper 2011), inspired by the idea of classifier ensemble encountered in supervised learning, has emerged as a technique for overcoming the problems associated with the individual clustering algorithms, such as robustness (Topchy et al. 2005), stability (Kuncheva and Vetrov 2006), parallelization (Tumer and Agogino 2008), and scalability (Hore et al. 2009), and has consequently found its applications in bioinformatics (Monti et al. 2003; Yu et al. 2007, 2011), image segmentation (J
Data Loading...