Optimal computation of all tandem repeats in a weighted sequence

  • PDF / 285,916 Bytes
  • 8 Pages / 595 x 794 pts Page_size
  • 0 Downloads / 186 Views

DOWNLOAD

REPORT


RESEARCH

Open Access

Optimal computation of all tandem repeats in a weighted sequence Carl Barton1* , Costas S Iliopoulos1,2,3 and Solon P Pissis1

Abstract Background: Tandem duplication, in the context of molecular biology, occurs as a result of mutational events in which an original segment of DNA is converted into a sequence of individual copies. More formally, a repetition or tandem repeat in a string of letters consists of exact concatenations of identical factors of the string. Biologists are interested in approximate tandem repeats and not necessarily only in exact tandem repeats. A weighted sequence is a string in which a set of letters may occur at each position with respective probabilities of occurrence. It naturally arises in many biological contexts and provides a method to realise the approximation among distinct adjacent occurrences of the same DNA segment. Results: Crochemore’s repetitions algorithm, also referred to as Crochemore’s partitioning algorithm, was introduced in 1981, and was the first optimal O (n log n)-time algorithm to compute all repetitions in a string of length n. In this article, we present a novel variant of Crochemore’s partitioningalgorithm for weighted sequences, which requires  optimal O (n log n) time, thus improving on the best known O n2 -time algorithm (Zhang et al., 2013) for computing all repetitions in a weighted sequence of length n. Keywords: Tandem repeats, Weighted sequences, IUPAC notation

Background A fundamental structural characteristic of a string of letters is its periodicity. Closely related to periodicity is the notion of repetition. Repetitions in strings are highly periodic factors, that is, two or more adjacent identical factors. For instance, the string TATA is a repetition in the string CTATAGT. Clearly a string may contain a quadratic number of repetitions. In 1981, it was shown by Crochemore that there could be O(n log n) maximal repetitions in a string of length n and an O(n log n)-time, thus optimal, algorithm was presented [1]. In 1999, Kolpakov and Kucherov presented an O(n)-time algorithm to compute the most compact representation of all repetitions known as runs [2]. Tandem duplication, in the context of molecular biology, occurs as a result of mutational events in which an original segment of DNA is converted into a sequence of individual copies. It usually results from replication slippage or from certain recombination events, such *Correspondence: [email protected] 1 King’s College London, London, UK Full list of author information is available at the end of the article

as unequal crossing-over or unequal sister chromatide exchange [3]. In this context, the result of a tandem duplication event is termed a tandem repeat. It appears in both eukaryotic [4] and prokaryotic [5] genomes. Through time, individual copies within a tandem repeat may change by additional, uncoordinated, mutations, and so only approximate tandem copies may be present. The major bottleneck in identifying biologically relevant tandem repeats in genomic sequenc