Computing Longest Common Substrings Via Suffix Arrays
Given a set of N strings Open image in new window of total length n over alphabet Σ one may ask to find, for each 2 ≤ K ≤ N, the longest substring β that appears in at least K strings in A. It is known that this problem can be solved in O(n) time with the
- PDF / 433,462 Bytes
- 12 Pages / 430 x 660 pts Page_size
- 1 Downloads / 196 Views
Abstract. Given a set of N strings A = {α1 , . . . , αN } of total length n over alphabet Σ one may ask to find, for each 2 ≤ K ≤ N , the longest substring β that appears in at least K strings in A. It is known that this problem can be solved in O(n) time with the help of suffix trees. However, the resulting algorithm is rather complicated (in particular, it involves answering certain least common ancestor queries in O(1) time). Also, its running time and memory consumption may depend on |Σ|. This paper presents an alternative, remarkably simple approach to the above problem, which relies on the notion of suffix arrays. Once the suffix array of some auxiliary O(n)-length string is computed, one needs a simple O(n)-time postprocessing to find the requested longest substring. Since a number of efficient and simple linear-time algorithms for constructing suffix arrays has been recently developed (with constant not depending on |Σ|), our approach seems to be quite practical.
1
Introduction
Consider the following problem: (LCS) Given a collection of N strings A = {α1 , . . . , αN } over alphabet Σ find, for each 2 ≤ K ≤ N , the longest string β that is a substring of at least K strings in A. It is known as a generalized version of the Longest Common Substring (LCS) problem and has a plenty of practical applications, see [Gus97] for a survey. Even in the simplest case of N = K = 2 a linear-time algorithm is not easy. A standard approach is to construct the so-called generalized suffix tree T (see [Gus97]) for α1 $1 and α2 $2 , which is a compacted symbol trie that captures all the substrings of α1 $1 , α2 $2 . Here $i are special symbols (called sentinels) that are distinct and do not appear in α1 and α2 . Then, nodes of T are examined in a bottom-up fashion and those having sentinels of both types in their subtrees are listed. Among these nodes of T let us choose a node v with the largest string depth (which is the length of the string obtained by reading letters along the path from root to v). The string that corresponds to v in T is the answer. See [Gus97] for more details. E.A. Hirsch et al. (Eds.): CSR 2008, LNCS 5010, pp. 64–75, 2008. c Springer-Verlag Berlin Heidelberg 2008
Computing Longest Common Substrings Via Suffix Arrays
65
In practice, the above approach is not very efficient since it involves computing T . Several linear-time algorithms for the latter task are known (possibly, the most famous one is due to Ukkonen [Ukk95]). However, suffix trees are still not very convenient. They do have linear space bound but the hidden constants can be pretty large. Most of modern algorithms for computing suffix trees have the time bound of O(n log |Σ|) (where n denotes the length of a string). Hence, their running time depends on |Σ|. Moreover, achieving this time bound requires using balanced search trees to store arcs. The latter data structures further increase constants in both time- and space-bounds making these algorithms rather impractical. Other options include employing hashtables or assuming that |Σ| is small and using direc
Data Loading...