Data Analysis and Bioinformatics

Data analysis methods and techniques are revisited in the case of biological data sets. Particular emphasis is given to clustering and mining issues. Clustering is still a subject of active research in several fields such as statistics, pattern recognitio

  • PDF / 1,153,052 Bytes
  • 16 Pages / 430 x 660 pts Page_size
  • 78 Downloads / 211 Views

DOWNLOAD

REPORT


Abstract. Data analysis methods and techniques are revisited in the case of biological data sets. Particular emphasis is given to clustering and mining issues. Clustering is still a subject of active research in several fields such as statistics, pattern recognition, and machine learning. Data mining adds to clustering the complications of very large data-sets with many attributes of different types. And this is a typical situation in biology. Some cases studies are also described. Keywords: Clustering, data mining, bio-informatics, Kernel methods, Hidden Markov Models, Multi-Layers Model.

1

Introduction

Bio-informatics is a new discipline devoted to the solution of biological problems, usually on the molecular level, by the use of techniques including applied mathematics, statistics, computer science, and artificial intelligence. Major research efforts regard sequence alignment [1], gene finding [2], genome assembly, protein structure alignment [3] and prediction [4], prediction of gene expression, protein-protein interactions, and the modeling of evolution [5]. Mining in structured data is particularly relevant for bio-informatics applications, since the majority of biological data is not kept in databases consisting of a single, flat table [6]. In fact, bio-informatics databases, BDB, are structured and linked objects, connected by relations representing a rich internal structure. Examples of BDB are databases of proteins [7], of small molecules [8], of metabolic and regulatory networks [9]. Moreover, biological data representations are structured and heterogeneous; they consist of large sequences (e.g. 106 gene sequences), 2D large structures (e.g. 105 ∼ 106 spots on DNA chips), 3D structures (e.d. DNA phosphate model, Figure 1a), graphs, networks, expression profiles, and phylogenetic trees (Figure 1b). Several issues are dealing with mining biological data, among them there are kernel methods for classification of microarray time series data [10]. This classification of gene expression time series has many potential applications in medicine and pharmacogenomics, such as disease diagnosis, drug response prediction or disease outcome prognosis, contributing to individualized medical treatment. Graph kernels representations of proteins have been designed to retrieve structure and bio-chemical information and protein function prediction. Feature graphs are considered to represent potential docking sites and retrieve activity maps 3D protein databases. A. Ghosh, R.K. De, and S.K. Pal (Eds.): PReMI 2007, LNCS 4815, pp. 373–388, 2007. c Springer-Verlag Berlin Heidelberg 2007 

374

V. Di Ges` u

(a)

(b)

Fig. 1. (a) 3D structure of the DNA phosphate model; (b) an example of phylogenetic tree

Concept of similarity play a relevant role in search both 2D and 3D shape matching in bio-molecular databases. For example, similar 3D shape can be retrieved by using a similarity model based on 3D shape histograms, 3D surface segments, and parametric surface functions including paraboloid and trigonometric polynomials that approx