Autoregressive Modeling and Feature Analysis of DNA Sequences
- PDF / 1,278,263 Bytes
- 16 Pages / 600 x 792 pts Page_size
- 13 Downloads / 208 Views
Autoregressive Modeling and Feature Analysis of DNA Sequences Niranjan Chakravarthy Department of Electrical Engineering, Arizona State University, Tempe, AZ 85287-5706, USA Email: [email protected]
A. Spanias Department of Electrical Engineering, Arizona State University, Tempe, AZ 85287-5706, USA Email: [email protected]
L. D. Iasemidis Harrington Department of Bioengineering, Arizona State University, Tempe, AZ 85287-9709, USA Email: [email protected]
K. Tsakalis Department of Electrical Engineering, Arizona State University, Tempe, AZ 85287-5706, USA Email: [email protected] Received 28 February 2003; Revised 15 September 2003 A parametric signal processing approach for DNA sequence analysis based on autoregressive (AR) modeling is presented. AR model residual errors and AR model parameters are used as features. The AR residual error analysis indicates a high specificity of coding DNA sequences, while AR feature-based analysis helps distinguish between coding and noncoding DNA sequences. An AR model-based string searching algorithm is also proposed. The effect of several types of numerical mapping rules in the proposed method is demonstrated. Keywords and phrases: DNA, autoregressive modeling, feature analysis.
1.
INTRODUCTION
The complete understanding of cell functionalities depends primarily on the various cell activities carried out by proteins. Information for the formation and activity of these proteins is coded in the deoxyribonucleic acid (DNA) sequences. For detection purposes, the vast amount of genomic data makes it necessary to define models for DNA segments such as the protein coding regions. Such models can also facilitate our understanding of the stored information and could provide a basis for the functional analysis of the DNA. Since the DNA is a discrete sequence, it can be interpreted as a discrete categorical or symbolic sequence and hence, digital signal processing (DSP) techniques could be used for DNA sequence analysis. The DNA sequence analysis problem can be considered as analogous to some forms of speech recognition problems. That is, coding and noncoding regions in DNA need to be identified from long nucleotide sequences, a process that bears some similarities to the problem of iden-
tifying phonemes from long sequences of speech signal samples. Currently proposed DSP techniques include the study of the spectral characteristics [1, 2, 3, 4] and the correlation structure [5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18] of DNA sequences. The measurement of spectra in most cases has been characterized by nonparametric Fourier transform techniques [1]. In some of the most common cases, the presence of a spectral peak [1] was used to characterize proteincoding regions in the DNA. On the other hand, correlations have been often characterized on the basis of the extent of power-law (long-range) behavior and the persistence of the power-law correlation sequence [6, 8]. Attempts have been also made to parameterize these correlations in terms of the scale of the power law [
Data Loading...