Estimating evolutionary distances between genomic sequences from spaced-word matches

PDF / 1,774,687 Bytes
12 Pages / 595 x 794 pts Page_size
2 Downloads / 292 Views

RESEARCH

Open Access

Estimating evolutionary distances between genomic sequences from spaced-word matches Burkhard Morgenstern1,2* , Bingyao Zhu3 , Sebastian Horwege1 and Chris André Leimeister1

Abstract Alignment-free methods are increasingly used to calculate evolutionary distances between DNA and protein sequences as a basis of phylogeny reconstruction. Most of these methods, however, use heuristic distance functions that are not based on any explicit model of molecular evolution. Herein, we propose a simple estimator dN of the evolutionary distance between two DNA sequences that is calculated from the number N of (spaced) word matches between them. We show that this distance function is more accurate than other distance measures that are used by alignment-free methods. In addition, we calculate the variance of the normalized number N of (spaced) word matches. We show that the variance of N is smaller for spaced words than for contiguous words, and that the variance is further reduced if our spaced-words approach is used with multiple patterns of ‘match positions’ and ‘don’t care positions’. Our software is available online and as downloadable source code at: http://spaced.gobics.de/. Keywords: k-mers, Spaced words, Alignment-free, Phylogeny, Word frequency, Distance estimation, Variance, Genome comparison

Background Alignment-free methods are increasingly used for DNA and protein sequence comparison since they are much faster than traditional alignment-based approaches [1]. Applications of alignment-free approaches include protein classification [2-5], read alignment [6-8], isoform quantification from RNAseq reads [9], sequence assembly [10], read-binning in metagenomics [11-16] or analysis of regulatory elements [17-20]. Most alignment-free algorithms are based on the word or k-mer composition of the sequences under study [21]. To measure pairwise distances between genomic or protein sequences, it is common practice to apply standard metrics such as the Euclidean or the Jensen-Shannon (JS) distance [22] to the relative word frequency vectors of the sequences. Recently, we proposed an alternative approach to alignment-free sequence comparison. Instead of considering *Correspondence: [email protected] 1 University of Göttingen, Department of Bioinformatics, Goldschmidtstr. 1, 37073 Göttingen, Germany 2 Université d’Evry Val d’Essonne, Laboratoire Statistique et Génome, UMR CNRS 8071, USC INRA 23 Boulevard de France, 91037 Evry, France Full list of author information is available at the end of the article

contiguous subwords of the input sequences, our approach considers spaced words, i.e. words containing wildcard or don’t care characters at positions defined by a predefined pattern P. This is similar as in the spaced-seeds approach that is used in database searching [23]. As in existing alignment-free methods, we compared the (relative) frequencies of these spaced words using standard distance measures [24]. In [25], we extended this approach by using whole sets P = {P1 , . . . , Pm } of patterns and calcula

Data Loading...

Estimating evolutionary distances between genomic sequences from spaced-word matches

Recommend Documents

Some New Approaches to Comparative Evaluation of Algorithms for Calculating Distances Between Genomic Sequences

DeepED: A Deep Learning Framework for Estimating Evolutionary Distances

Estimating Biophysical Parameters from BOLD Signals through Evolutionary-Based Optimization

Compressing Genomic Sequences by Using Deep Learning

Estimating the Attenuation of Seismic Wave Energy at Short Distances from Kizimen Volcano, Kamchatka

One-Class Ensembles for Rare Genomic Sequences Identification

Frequency spectra characterization of noncoding human genomic sequences

Determination of Appropriate Distances Between Industry and Residential Areas

Estimating Intrinsic Camera Parameters from the Fundamental Matrix Using an Evolutionary Approach

Balance Layout Problem with the Optimized Distances Between Objects

Face Recognition From Image Sequences

Map Distances