A Fast, Alignment-Free, Conservation-Based Method for Transcription Factor Binding Site Discovery
As an increasing number of eukaryotic genomes are being sequenced, comparative studies aimed at detecting regulatory elements in intergenic sequences are becoming more prevalent. Most comparative methods for transcription factor (TF) binding site discover
- PDF / 594,955 Bytes
- 14 Pages / 430 x 660 pts Page_size
- 21 Downloads / 171 Views
Abstract. As an increasing number of eukaryotic genomes are being sequenced, comparative studies aimed at detecting regulatory elements in intergenic sequences are becoming more prevalent. Most comparative methods for transcription factor (TF) binding site discovery make use of global or local alignments of orthologous regulatory regions to assess whether a particular DNA site is conserved across related organisms, and thus more likely to be functional. Since binding sites are usually short, sometimes degenerate, and often independent of orientation, alignment algorithms may not align them correctly. Here, we present a novel, alignment-free approach for incorporating conservation information into TF motif discovery. We relax the definition of conserved sites: we consider a DNA site within a regulatory region to be conserved in an orthologous sequence if it occurs anywhere in that sequence, irrespective of orientation. We use this definition to derive informative priors over DNA sequence positions, and incorporate these priors into a Gibbs sampling algorithm for motif discovery. Our approach is simple and fast. It does not require sequence alignments, nor the phylogenetic relationships between the orthologous sequences, and yet it is more effective on real biological data than methods that do.
1
Introduction
With recent advances in DNA sequencing technologies, the number of closely related genomes being sequenced [1, 2, 3] has increased tremendously. Consequently, this has led to an increased emphasis on comparative studies focused on detecting functional elements in intergenic DNA sequences. Functional elements, including TF binding sites, are known to evolve at a slower rate than non-functional elements, and therefore DNA sites that are well conserved in orthologous regulatory regions are considered good candidates for TF binding sites. A plethora of algorithms use evolutionary conservation information for de novo TF motif discovery, either by filtering the putative regions according to their conservation levels and then applying conventional motif finders, or by incorporating the conservation information into the motif finder itself. The former
These authors contributed equally to this work.
M. Vingron and L. Wong (Eds.): RECOMB 2008, LNBI 4955, pp. 98–111, 2008. c Springer-Verlag Berlin Heidelberg 2008
A Fast, Alignment-Free, Conservation-Based Motif Discovery
99
approach has a major limitation: motifs that are not well conserved are likely to be missed. Most conservation-based motif finders therefore take the latter approach. These methods can be further divided into two main categories: 1) ‘single gene, multiple species’, and 2) ‘multiple genes, multiple species’. Methods in the first category (e.g., FootPrinter [4], the phylogenetic Gibbs sampler of Newberg et al. [5]) take as input the regulatory region of a single gene, together with its orthologs from related organisms. Methods in the second category (e.g., the method of Kellis et al. [1], Converge [6, 7], PhyloCon [8], PhyME [9], PhyloGibbs [10], Orth
Data Loading...