Cluster Structure Inference Based on Clustering Stability with Applications to Microarray Data Analysis

PDF / 918,137 Bytes
17 Pages / 600 x 792 pts Page_size
48 Downloads / 224 Views

Cluster Structure Inference Based on Clustering Stability with Applications to Microarray Data Analysis ˘ Ciprian Doru Giurcaneanu Institute of Signal Processing, Tampere University of Technology, P.O. Box 553, FIN-33101 Tampere, Finland Email: [email protected]

˘ ¸ Ioan Tabus Institute of Signal Processing, Tampere University of Technology, P.O. Box 553, FIN-33101 Tampere, Finland Email: [email protected] Received 28 February 2003; Revised 7 July 2003 This paper focuses on the stability-based approach for estimating the number of clusters K in microarray data. The cluster stability approach amounts to performing clustering successively over random subsets of the available data and evaluating an index which expresses the similarity of the successive partitions obtained. We present a method for automatically estimating K by starting from the distribution of the similarity index. We investigate how the selection of the hierarchical clustering (HC) method, respectively, the similarity index, influences the estimation accuracy. The paper introduces a new similarity index based on a partition distance. The performance of the new index and that of other well-known indices are experimentally evaluated by comparing the “true” data partition with the partition obtained at each level of an HC tree. A case study is conducted with a publicly available Leukemia dataset. Keywords and phrases: clustering stability, number of clusters, hierarchical clustering methods, similarity indices, partitiondistance, microarray data.

1.

INTRODUCTION

The clustering algorithms are frequently used for analyzing the microarray data. While various clustering methods help the practitioner in bioinformatics to ascertain diﬀerent characteristics in structural organization of microarray datasets, the task of selecting the most appropriate algorithm for solving a particular problem is nontrivial. While various clustering methods are applied in hundreds of microarray research papers, a question arises frequently, namely, how to compare two diﬀerent partitions of the same dataset obtained by two diﬀerent algorithms. The comparison becomes more diﬃcult when the two partitions do not contain the same number of clusters. The accurate estimation for the number of clusters K is essential because most of the existing clustering procedures request K as input. The robustness of the clustering algorithms is usually studied by investigating their stability with respect to perturbations changing the original dataset, for example, by drawing random subsets or by artificially adding noise [1]. The stability methods can be also used in exploratory data anal-

ysis when little prior information is available regarding the dataset, which is generally the case with microarray data. The main principle is to randomly split the dataset and cluster each subset independently, and then to check the stability (or degree of agreement) of the two obtained partitions. The clustering is stable if the cluster memberships inferred in the two subsets are similar to the memberships in

Data Loading...

Cluster Structure Inference Based on Clustering Stability with Applications to Microarray Data Analysis

Recommend Documents

Microarray Data Analysis Methods and Applications

Microarray Data Analysis

Cancer Identification Based on DNA Microarray Data

Applications of Emerging Patterns for Microarray Gene Expression Data Analysis

Stability-based Validation of Clustering

The Local Maximum Clustering Method and Its Application in Microarray Gene Expression Data Analysis

Bayesian Inference with Geodetic Applications

Cluster Stability

Grouping-Aware Data Placement in HDFS for Data-Intensive Applications Based on Graph Clustering

Non-hierarchical Clustering for Large Data Without Recalculating Cluster Center

Statistical inference for Markov chains with applications to credit risk

Fundamentals of Uncertainty Calculi with Applications to Fuzzy Inference