Covariance-Model-Based RNA Gene Finding: Using Dynamic Programming versus Evolutionary Computing

This chapter compares the traditional dynamic programming RNA gene finding methodolgy with an alternative evolutionary computation approach. Both methods take a set of estimated covariance model parameters for a non-coding RNA family as given. The differe

  • PDF / 229,432 Bytes
  • 26 Pages / 439.37 x 666.142 pts Page_size
  • 93 Downloads / 143 Views

DOWNLOAD

REPORT


1 Introduction The initial focus of interpreting the output of sequencing projects such as the Human Genome Project [1] has been on annotating those portions of the genome sequences that code for proteins. More recently, it has been recognized that many significant regulatory and catalytic functions can be attributed to RNA transcripts that are never translated into protein products [2]. These functional RNA (fRNA) or non-coding RNA (ncRNA) molecules have genes which require an entirely different approach to gene search than protein-coding genes. Protein-coding genes are usually detected by gene finding algorithms that generically search for putative gene locations and then later classify these genes into families. As an example, putative protein-coding genes could be identified using the GENESCAN program [3]. Classification of these putative protein-coding genes could then be done using profile hidden Markov models (HMMs) [4] to yield families of proteins (or protein domains) such as that in Pfam [5]. It is not necessary to scan entire genomes with an HMM since a small subset of the genome has already been identified by the gene finding algorithm as possible protein-coding gene locations. Unlike protein-coding genes, RNA genes are not associated with promoter regions and open reading frames. As a result, direct search for RNA genes using only S.F. Smith: Covariance-Model-Based RNA Gene Finding: Using Dynamic Programming versus Evolutionary Computing, Studies in Computational Intelligence (SCI) 94, 183–208 (2008) c Springer-Verlag Berlin Heidelberg 2008 www.springerlink.com 

184

S.F. Smith

generic characteristics has not been successful [6]. Instead, a combined RNA gene finding and gene family classification is undertaken using models of a gene family for database search over entire genomes. This has the disadvantage that RNA genes belonging to entirely novel families will not be found, but it is the only currently available method that works. It also means that the amount of genetic information that needs to be processed by the combined gene finder and classifier is much larger than for protein classifiers. Functional RNA is made of single-stranded RNA with intramolecular base pairing. Whereas protein-coding RNA transcripts (mRNA) are primarily information carriers, functional RNA often depends on its three dimensional shape for the performance of its task. This results in conservation of three dimensional structure, but not necessarily primary sequence. The three dimensional shape of an RNA molecule is almost entirely determined by the intramolecular base pairing pattern of the molecule’s nucleotides. There are many examples of RNA families with very little primary sequence homology, but very well conserved secondary structure (see pp. 264–265 in [7]). It is very difficult to find RNA genes without taking conservation of secondary structure into account. Most homology search algorithms such as BLAST [8], Fasta [9], SmithWaterman [10], and profile HMMs only model primary sequence and are therefore not well suited for RNA gene