A Digital Signal Processing Method for Gene Prediction with Improved Noise Suppression

  • PDF / 912,355 Bytes
  • 7 Pages / 600 x 792 pts Page_size
  • 26 Downloads / 160 Views

DOWNLOAD

REPORT


A Digital Signal Processing Method for Gene Prediction with Improved Noise Suppression Trevor W. Fox Research and Development Department, Intelligent Engines Corporation, 903 42 St. SW, Calgary, Alberta, Canada T3C-1Y9 Email: [email protected]

Alex Carreira Department of Electrical and Computer Engineering, University of Calgary, 2500 University Drive N.W., Calgary, Alberta, Canada T2N 1N4 Email: [email protected] Received 1 March 2003; Revised 15 September 2003 It has been observed that the protein-coding regions of DNA sequences exhibit period-three behaviour, which can be exploited to predict the location of coding regions within genes. Previously, discrete Fourier transform (DFT) and digital filter-based methods have been used for the identification of coding regions. However, these methods do not significantly suppress the noncoding regions in the DNA spectrum at 2π/3. Consequently, a noncoding region may inadvertently be identified as a coding region. This paper introduces a new technique (a single digital filter operation followed by a quadratic window operation) that suppresses nearly all of the noncoding regions. The proposed method therefore improves the likelihood of correctly identifying coding regions in such genes. Keywords and phrases: gene prediction, digital filter, DNA.

1.

INTRODUCTION

Finding coding regions (exons) in a DNA strand involves searching amongst the many nucleotides that comprise a DNA strand. Typically a DNA molecule contains millions to hundreds of millions of elements [1]. The problem of finding exons in a DNA sequence is well suited to computers because DNA sequences can be represented by data that is easily processed by a computer. DNA strands can be represented by sequences of letters from a four-character alphabet. Convention dictates the use of the letters A, T, C, and G in each element to represent each of the four distinct nucleotides [1]. A nucleotide has two distinct ends: a 3 end and a 5 end. A covalent chemical bond links the 5 end of one nucleotide to the 3 end of another nucleotide. A DNA strand is comprised of many nucleotides linked in this fashion [1]. The DNA sequence representing a DNA strand consists of the letters A, T, C, and G listed in a left-to-right fashion corresponding to the nucleotides that make up the strand arranged left to right from their 5 to 3 ends [1]. A DNA strand can be divided into genes and intergenic spaces. Genes are responsible for protein synthesis. A gene can be further subdivided into exons and introns for cells with a nucleus (eukaryotes) [2]. Cells without a nucleus are

called prokaryotes and do not contain introns [2]. The exons, coding regions within genes, are denoted by start and stop codons. Codons are a subsequence of three letters within the DNA sequence. Because codons are comprised of three letters from the four-letter alphabet that makes up a DNA sequence, there are 64 possible codons [1]. Of the 64 possible codons, there are one start codon and three stop codons, and the remainder of the codons correspond to one of the tw