Segmentation of DNA into Coding and Noncoding Regions Based on Recursive Entropic Segmentation and Stop-Codon Statistics

  • PDF / 1,073,661 Bytes
  • 11 Pages / 600 x 792 pts Page_size
  • 40 Downloads / 142 Views

DOWNLOAD

REPORT


Segmentation of DNA into Coding and Noncoding Regions Based on Recursive Entropic Segmentation and Stop-Codon Statistics Daniel Nicorici Tampere International Center for Signal Processing, Tampere University of Technology, P.O. Box 553, Tampere FIN-33101, Finland Email: [email protected]

Jaakko Astola Tampere International Center for Signal Processing, Tampere University of Technology, P.O. Box 553, Tampere FIN-33101, Finland Email: [email protected] Received 28 February 2003; Revised 15 September 2003 Heterogeneous DNA sequences can be partitioned into homogeneous domains that are comprised of the four nucleotides A, C, G, and T and the stop codons. Recursively, we apply a new entropic segmentation method on DNA sequences using Jensen-Shannon and Jensen-R´enyi divergences in order to find the borders between coding and noncoding DNA regions. We have chosen 12and 18-symbol alphabets that capture (i) the differential nucleotide composition in codons and (ii) the differential stop-codon composition along all the three phases in both strands of the DNA. The new segmentation method is based on the Jensen-R´enyi divergence measure, nucleotide statistics, and stop-codon statistics in both DNA strands. The recursive segmentation process requires no prior training on known datasets. Consequently, for three entire genomes of bacteria, we find that the use of nucleotide composition, stop-codon composition, and Jensen-R´enyi divergence improve the accuracy of finding the borders between coding and noncoding regions in DNA sequences. Keywords and phrases: recursive segmentation, DNA sequence, information divergence measures, statistics of stop codons, Bayesian information criterion.

1.

INTRODUCTION

The computational identification of genes and coding regions in DNA sequences is a major goal and a long-lasting topic for molecular biology, especially for the human genome project [1, 2]. One of the main goals of the human genome project is to provide a complete list of annotated genes that will be used in the biomedical research. Also, methods for reliable identification of genes in anonymous sequences of DNA can speed the process. A number of such methods exist but their predictive performance for finding genes is still not satisfactory [3]. There are two basic problems in gene finding: detection of protein-binding sites of the genes and detection of regions that code for proteins. These problems still are not satisfactorily solved, and the reliable detection of genes and coding regions in DNA sequences is critical for the success of the computational gene discovery from annotated genome sequences [4]. We address in this study the problem of finding the coding regions in DNA sequences that code for proteins.

Almost everything in the organism of living beings is made of proteins. According to the central dogma that forms the backbone of molecular biology, the DNA codes for the production of messenger RNA (mRNA) during the transcription process. The ribosomes “read” this information and use it for protein synthesis during