A Novel Metric for Redundant Gene Elimination Based on Discriminative Contribution

As a high dimensional problem, analysis of microarray data sets is a hard task, where many weakly relevant but redundant features hurt generalization performance of classifiers. There are previous works to handle this problem by using linear or nonlinear

  • PDF / 361,103 Bytes
  • 12 Pages / 430 x 660 pts Page_size
  • 51 Downloads / 145 Views

DOWNLOAD

REPORT


4

Institute of System Biology, Shanghai University, Shanghai 200444, China 2 School of Computer Engineering and Science, Shanghai University, Shanghai 200072, China [email protected] 3 Harvard Medical School, Harvard University, Cambridge, Massachusetts 02140-0888 USA National Human Genome Research Institute National Institutes of Health (NIH) U.S., Department of Health and Human Services Bethesda, MD 20852 USA

Abstract. As a high dimensional problem, analysis of microarray data sets is a hard task, where many weakly relevant but redundant features hurt generalization performance of classifiers. There are previous works to handle this problem by using linear or nonlinear filters, but these filters do not consider discriminative contribution of each feature by utilizing the label information. Here we propose a novel metric based on discriminative contribution to perform redundant feature elimination. By the new metric, complementary features are likely to be reserved, which is beneficial for the final classification. Experimental results on several microarray data sets show our proposed metric for redundant feature elimination based on discriminative contribution is better than the previous state-of-arts linear or nonlinear metrics on the problem of analysis of microarray data sets.

1

Introduction

The rapid advances in gene expression microarray technology enable simultaneously measuring the expression levels for thousands or tens of thousands of genes in a single experiment [1]. Analysis of microarray data presents unprecedented opportunities and challenges for data mining in areas such as gene clustering, class discovery, and sample classification [2,3,4]. In sample classification, a microarray data set is provided as a training set of labeled samples. The task is to build a classifier that accurately predicts the classes of novel unlabeled samples. A typical data set has thousands of genes but only a small number of samples (often less than a hundred). The number of samples is likely to remain small at least for the near future due to the expense of collecting microarray samples [5]. The nature of relatively high dimensionality but small sample size in microarray data cause the known problem of ”curse of dimensionality”. Therefore, selecting a small number of discriminative genes from thousands of genes is essential for successful sample classification. 

Corresponding author.

I. M˘ andoiu, R. Sunderraman, and A. Zelikovsky (Eds.): ISBRA 2008, LNBI 4983, pp. 256–267, 2008. c Springer-Verlag Berlin Heidelberg 2008 

A Novel Metric for Redundant Gene Elimination

257

Feature selection, a process of choosing a subset of features from the original ones, is frequently used as a preprocessing technique in data mining. It has been proved effective in reducing dimensionality, improving mining efficiency, increasing mining accuracy, and enhancing result comprehensibility [6]. In the field of bioinformatics, the most commonly used procedures of feature selection (gene selection) are based on a score which is calculated for all g