Segmentation of DNA into Coding and Noncoding Regions Based on Recursive Entropic Segmentation and Stop-Codon Statistics

PDF / 1,073,661 Bytes
11 Pages / 600 x 792 pts Page_size
40 Downloads / 247 Views

Segmentation of DNA into Coding and Noncoding Regions Based on Recursive Entropic Segmentation and Stop-Codon Statistics Daniel Nicorici Tampere International Center for Signal Processing, Tampere University of Technology, P.O. Box 553, Tampere FIN-33101, Finland Email: [email protected]

Jaakko Astola Tampere International Center for Signal Processing, Tampere University of Technology, P.O. Box 553, Tampere FIN-33101, Finland Email: [email protected] Received 28 February 2003; Revised 15 September 2003 Heterogeneous DNA sequences can be partitioned into homogeneous domains that are comprised of the four nucleotides A, C, G, and T and the stop codons. Recursively, we apply a new entropic segmentation method on DNA sequences using Jensen-Shannon and Jensen-R´enyi divergences in order to find the borders between coding and noncoding DNA regions. We have chosen 12and 18-symbol alphabets that capture (i) the diﬀerential nucleotide composition in codons and (ii) the diﬀerential stop-codon composition along all the three phases in both strands of the DNA. The new segmentation method is based on the Jensen-R´enyi divergence measure, nucleotide statistics, and stop-codon statistics in both DNA strands. The recursive segmentation process requires no prior training on known datasets. Consequently, for three entire genomes of bacteria, we find that the use of nucleotide composition, stop-codon composition, and Jensen-R´enyi divergence improve the accuracy of finding the borders between coding and noncoding regions in DNA sequences. Keywords and phrases: recursive segmentation, DNA sequence, information divergence measures, statistics of stop codons, Bayesian information criterion.

1.

INTRODUCTION

The computational identification of genes and coding regions in DNA sequences is a major goal and a long-lasting topic for molecular biology, especially for the human genome project [1, 2]. One of the main goals of the human genome project is to provide a complete list of annotated genes that will be used in the biomedical research. Also, methods for reliable identification of genes in anonymous sequences of DNA can speed the process. A number of such methods exist but their predictive performance for finding genes is still not satisfactory [3]. There are two basic problems in gene finding: detection of protein-binding sites of the genes and detection of regions that code for proteins. These problems still are not satisfactorily solved, and the reliable detection of genes and coding regions in DNA sequences is critical for the success of the computational gene discovery from annotated genome sequences [4]. We address in this study the problem of finding the coding regions in DNA sequences that code for proteins.

Almost everything in the organism of living beings is made of proteins. According to the central dogma that forms the backbone of molecular biology, the DNA codes for the production of messenger RNA (mRNA) during the transcription process. The ribosomes “read” this information and use it for protein synthesis during

Data Loading...

Segmentation of DNA into Coding and Noncoding Regions Based on Recursive Entropic Segmentation and Stop-Codon Statistics

Recommend Documents

Deep semantic segmentation-based multiple description coding

Statistical Segmentation of Regions of Interest on a Mammographic Image

Image encryption algorithm based on LDCML and DNA coding sequence

Algorithmic Research Based on Image Segmentation

Automatic glioma segmentation based on adaptive superpixel

Potato Detection and Segmentation Based on Mask R-CNN

Segmentation

Segmentation-Based Salient Object Detection

Deep Patch-Based Human Segmentation

Segmentation and Choice Models

Improved Brain Segmentation Using Pixel Separation and Additional Segmentation Features

Query Segmentation and Tagging