Clustering Time Series Gene Expression Data Based on Sum-of-Exponentials Fitting

  • PDF / 543,038 Bytes
  • 15 Pages / 600 x 792 pts Page_size
  • 26 Downloads / 151 Views

DOWNLOAD

REPORT


Clustering Time Series Gene Expression Data Based on Sum-of-Exponentials Fitting ˘ Ciprian Doru Giurcaneanu Institute of Signal Processing, Tampere University of Technology, P.O. Box 553, 33101 Tampere, Finland Email: [email protected]

˘ ¸ Ioan Tabus Institute of Signal Processing, Tampere University of Technology, P.O. Box 553, 33101 Tampere, Finland Email: [email protected]

Jaakko Astola Institute of Signal Processing, Tampere University of Technology, P.O. Box 553, 33101 Tampere, Finland Email: [email protected] Received 8 June 2004; Revised 26 October 2004; Recommended for Publication by Xiaodong Wang This paper presents a method based on fitting a sum-of-exponentials model to the nonuniformly sampled data, for clustering the time series of gene expression data. The structure of the model is estimated by using the minimum description length (MDL) principle for nonlinear regression, in a new form, incorporating a normalized maximum-likelihood (NML) model for a subset of the parameters. The performance of the structure estimation method is studied using simulated data, and the superiority of the new selection criterion over earlier criteria is demonstrated. The accuracy of the nonlinear estimates of the model parameters is analyzed with respect to the Cram´er-Rao lower bounds. Clustering examples of gene expression data sets from a developmental biology application are presented, revealing gene grouping into clusters according to functional classes. Keywords and phrases: nonuniformly sampled data, sum-of-exponentials model, normalized maximum likelihood, time series clustering, gene expression data, developmental biology.

1.

INTRODUCTION

The gene expression time profiles are a rich source of information about the dynamics of the underlying genomic network. The experiments are often taken at nonuniform time points, suggested by the biologist’s intuition about the time scale of the important changes in the analyzed biological process, for example, a developmental process or administration of a drug. Clustering the time profiles of the thousands of genes recorded by the microarrays is a very important exploratory problem, for which several methods have been proposed in the past [1, 2, 3]. Most of the existing methods, no matter whatever heuristically motivated, or model-based methods [4] do not make use of the time values at which the measurements have been taken, loosing potentially useful information regarding the analyzed waveforms. Some approaches that take into account the temporal structure in gene expression data are based on hidden Markov model [5], spline approximation [6], or on analysis of temporal variation [7]. In [8], an autoregressive model is used for the gene expression time series, and the

clustering is performed with a Bayesian criterion which measures the similarity between two time series. A comprehensive study on various clustering methods applied to gene expression data that are time series can be found in [9]. A general methodology for modelling the time series collected at no