Mobile Genetic Elements Protocols and Genomic Applications

Transposable elements are used as genetic tools for dissecting the function of a specific gene and elaborating on mechanisms leading to genetic change and diversity, and the evolutionary impact of mobile DNA on the biology and evolution of organism. In, M

  • PDF / 998,228 Bytes
  • 22 Pages / 504.57 x 720 pts Page_size
  • 61 Downloads / 220 Views

DOWNLOAD

REPORT


1. Introduction The many types of repeats that occur in genomic sequences have been largely described in the literature and new types are often discovered in newly sequenced species. The management of such a quantity of families rests on dedicated databases (1) or software pipelines, like REPET (see ref. 2 and the chapter inside this volume). Besides the study of these natural repeat families, biologists are routinely concerned with sequence comparison, a task that relies on the search of words common to a given set of sequences and is

Yves Bigot (ed.), Mobile Genetic Elements: Protocols and Genomic Applications, Methods in Molecular Biology, vol. 859, DOI 10.1007/978-1-61779-603-6_4, © Springer Science+Business Media, LLC 2012

69

70

J. Nicolas

almost always based on the pre-computation of a repeat index on the sequences. This issue became even more critical with the advent of next-generation sequencing technologies, which is leading each laboratory to access or to produce an increasing quantity of sequence data. Considering that, whole-genome sequencing is becoming an ordinary practice on bacteria and the analysis of the genetic diversity of eukaryotic populations by means of large-scale re-sequencing projects is becoming a general trend. In this context, the presence of repeats causes major assembly issues and requires further algorithmic developments. On the theoretical side, it is important to try describing all these repeats with a set of precise common concepts in order to better understand their structure and to rationalize the design of search algorithms. Simplified generic models are used to capture important formal properties of biological repeats. Most of them are issued from problems that arose in other domains, such as data compression or Web indexing. Among the corresponding studies, those addressing the fundamental problem of looking for exact repeats prevail. This paper proposes a quick review of concepts at the core of any repeat model in a sequence, mostly focusing on exact repeats. It seems clear to us that any people interested in large-scale study of genomic repeats should have a good understanding of these concepts and we have tried to point all along the chapter at efficient tools that could help turning theory into practice.

2. Working on Exact Repeats Exact repeats are words with several identical occurrences that are possibly overlapping. The search for approximate repeats is always based on the search for exact repeats that reflect the presence of the repeated structure and serve as anchor points during the exploration. Exact repeats have been extensively studied, starting from simple k-grams or k-mers, which are just words of fixed size k. A fundamental issue is to limit the number of representatives that are necessary in order to describe all “interesting” repeats. The notion of maximality is quite natural in this respect but not so trivial to define properly. For instance, given the string GTTCGTTTCTTA, the single letter T is repeated seven times in the string, making it the exa