Estimating evolutionary distances between genomic sequences from spaced-word matches
- PDF / 1,774,687 Bytes
- 12 Pages / 595 x 794 pts Page_size
- 2 Downloads / 170 Views
RESEARCH
Open Access
Estimating evolutionary distances between genomic sequences from spaced-word matches Burkhard Morgenstern1,2* , Bingyao Zhu3 , Sebastian Horwege1 and Chris André Leimeister1
Abstract Alignment-free methods are increasingly used to calculate evolutionary distances between DNA and protein sequences as a basis of phylogeny reconstruction. Most of these methods, however, use heuristic distance functions that are not based on any explicit model of molecular evolution. Herein, we propose a simple estimator dN of the evolutionary distance between two DNA sequences that is calculated from the number N of (spaced) word matches between them. We show that this distance function is more accurate than other distance measures that are used by alignment-free methods. In addition, we calculate the variance of the normalized number N of (spaced) word matches. We show that the variance of N is smaller for spaced words than for contiguous words, and that the variance is further reduced if our spaced-words approach is used with multiple patterns of ‘match positions’ and ‘don’t care positions’. Our software is available online and as downloadable source code at: http://spaced.gobics.de/. Keywords: k-mers, Spaced words, Alignment-free, Phylogeny, Word frequency, Distance estimation, Variance, Genome comparison
Background Alignment-free methods are increasingly used for DNA and protein sequence comparison since they are much faster than traditional alignment-based approaches [1]. Applications of alignment-free approaches include protein classification [2-5], read alignment [6-8], isoform quantification from RNAseq reads [9], sequence assembly [10], read-binning in metagenomics [11-16] or analysis of regulatory elements [17-20]. Most alignment-free algorithms are based on the word or k-mer composition of the sequences under study [21]. To measure pairwise distances between genomic or protein sequences, it is common practice to apply standard metrics such as the Euclidean or the Jensen-Shannon (JS) distance [22] to the relative word frequency vectors of the sequences. Recently, we proposed an alternative approach to alignment-free sequence comparison. Instead of considering *Correspondence: [email protected] 1 University of Göttingen, Department of Bioinformatics, Goldschmidtstr. 1, 37073 Göttingen, Germany 2 Université d’Evry Val d’Essonne, Laboratoire Statistique et Génome, UMR CNRS 8071, USC INRA 23 Boulevard de France, 91037 Evry, France Full list of author information is available at the end of the article
contiguous subwords of the input sequences, our approach considers spaced words, i.e. words containing wildcard or don’t care characters at positions defined by a predefined pattern P. This is similar as in the spaced-seeds approach that is used in database searching [23]. As in existing alignment-free methods, we compared the (relative) frequencies of these spaced words using standard distance measures [24]. In [25], we extended this approach by using whole sets P = {P1 , . . . , Pm } of patterns and calcula
Data Loading...