Combined clustering models for the analysis of gene expression

  • PDF / 450,673 Bytes
  • 5 Pages / 612 x 792 pts (letter) Page_size
  • 19 Downloads / 272 Views

DOWNLOAD

REPORT


ELEMENTARY PARTICLES AND FIELDS Theory

Combined Clustering Models for the Analysis of Gene Expression* M. Angelova** and J. Ellman*** Northumbria University, Newcastle upon Tyne, UK Received April 22, 2009

Abstract—Clustering has become one of the fundamental tools for analyzing gene expression and producing gene classifications. Clustering models enable finding patterns of similarity in order to understand gene function, gene regulation, cellular processes and sub-types of cells. The clustering results however have to be combined with sequence data or knowledge about gene functionality in order to make biologically meaningful conclusions. In this work, we explore a new model that integrates gene expression with sequence or text information. DOI: 10.1134/S1063778810020067

1. INTRODUCTION Life sciences are currently undergoing an information revolution as a result of development of techniques and tools that allow the collection of biological information at a high level of detail and large quantities. Microarray technology provides some of the most promising tools available to researchers today as it allows to measure simultaneously the expression levels of thousand of genes under controlled experimental conditions. The ability of this technology to take a snapshot of a whole gene expression pattern opens enormous possibilities. For example, DNA microarrays have been successfully used to study genome-wide patterns of gene expression [1– 4] and are capable of providing fundamental insights into biological processes such as gene function and gene regulation [1, 2], cell cycle [1, 4], and cancer [2, 3]. The motivation for the large-scale gene expression analysis lays with the central dogma of molecular biology [5, 6], which justifies the premise that information about the functional state of an organism is to a great extend determined by the information on the gene expression. One of the most powerful automatic techniques for the analysis of high-throughput gene expression data is clustering [4]. It is the exploratory, unsupervised process of partitioning data into groups (clusters) by finding similarity patterns within gene expression data. An underlying assumption in clustering is that genes in a cluster are functionally related. This implies that many of the genes could also be coregulated and thus share transcription factor binding ∗

The text was submitted by the authors in English. E-mail: [email protected] *** E-mail: [email protected] **

motifs in their upstream sequences [7]. Clustering results need to be evaluated by biologically significant information, such as previously known biological facts, theories and results. Biological and medical literature databases store such published information and can be used to cross-reference experimental and analytical results, and even drive the interpretation and organization of the expression data [8, 9]. In this paper we discuss a combined model that integrates gene expression results with sequence data and published knowledge about gene functionalities in orde