Computing Longest Common Substrings Via Suffix Arrays

Given a set of N strings Open image in new window of total length n over alphabet Σ one may ask to find, for each 2 ≤ K ≤ N, the longest substring β that appears in at least K strings in A. It is known that this problem can be solved in O(n) time with the

PDF / 433,462 Bytes
12 Pages / 430 x 660 pts Page_size
1 Downloads / 222 Views

DOWNLOAD

REPORT

Abstract. Given a set of N strings A = {α1 , . . . , αN } of total length n over alphabet Σ one may ask to ﬁnd, for each 2 ≤ K ≤ N , the longest substring β that appears in at least K strings in A. It is known that this problem can be solved in O(n) time with the help of suﬃx trees. However, the resulting algorithm is rather complicated (in particular, it involves answering certain least common ancestor queries in O(1) time). Also, its running time and memory consumption may depend on |Σ|. This paper presents an alternative, remarkably simple approach to the above problem, which relies on the notion of suﬃx arrays. Once the suﬃx array of some auxiliary O(n)-length string is computed, one needs a simple O(n)-time postprocessing to ﬁnd the requested longest substring. Since a number of eﬃcient and simple linear-time algorithms for constructing suﬃx arrays has been recently developed (with constant not depending on |Σ|), our approach seems to be quite practical.

1

Introduction

Consider the following problem: (LCS) Given a collection of N strings A = {α1 , . . . , αN } over alphabet Σ ﬁnd, for each 2 ≤ K ≤ N , the longest string β that is a substring of at least K strings in A. It is known as a generalized version of the Longest Common Substring (LCS) problem and has a plenty of practical applications, see [Gus97] for a survey. Even in the simplest case of N = K = 2 a linear-time algorithm is not easy. A standard approach is to construct the so-called generalized suﬃx tree T (see [Gus97]) for α1 $1 and α2 $2 , which is a compacted symbol trie that captures all the substrings of α1 $1 , α2 $2 . Here $i are special symbols (called sentinels) that are distinct and do not appear in α1 and α2 . Then, nodes of T are examined in a bottom-up fashion and those having sentinels of both types in their subtrees are listed. Among these nodes of T let us choose a node v with the largest string depth (which is the length of the string obtained by reading letters along the path from root to v). The string that corresponds to v in T is the answer. See [Gus97] for more details. E.A. Hirsch et al. (Eds.): CSR 2008, LNCS 5010, pp. 64–75, 2008. c Springer-Verlag Berlin Heidelberg 2008

Computing Longest Common Substrings Via Suﬃx Arrays

65

In practice, the above approach is not very eﬃcient since it involves computing T . Several linear-time algorithms for the latter task are known (possibly, the most famous one is due to Ukkonen [Ukk95]). However, suﬃx trees are still not very convenient. They do have linear space bound but the hidden constants can be pretty large. Most of modern algorithms for computing suﬃx trees have the time bound of O(n log |Σ|) (where n denotes the length of a string). Hence, their running time depends on |Σ|. Moreover, achieving this time bound requires using balanced search trees to store arcs. The latter data structures further increase constants in both time- and space-bounds making these algorithms rather impractical. Other options include employing hashtables or assuming that |Σ| is small and using direc

Data Loading...

Computing Longest Common Substrings Via Suffix Arrays

Recommend Documents

Suffix arrays

Relative Lempel-Ziv Compression of Suffix Arrays

String Editing and Longest Common Subsequences

gsufsort: constructing suffix arrays, LCP arrays and BWTs for string collections

Suffix trees

Compressed Suffix Array

Suffix Tree

Compact Suffix Tries

Suffix Stripping

Suffix Trees

Compressed Suffix Array

Suffix Array Construction