LZ77-Based Self-indexing with Faster Pattern Matching
To store and search genomic databases efficiently, researchers have recently started building self-indexes based on LZ77. As the name suggests, a self-index for a string supports both random access and pattern matching queries. In this paper we show how,
- PDF / 242,203 Bytes
- 12 Pages / 439.363 x 666.131 pts Page_size
- 66 Downloads / 252 Views
University of Helsinki, Finland Max Planck Institute, Germany University of Kansas, United States
2 3
Abstract. To store and search genomic databases efficiently, researchers have recently started building self-indexes based on LZ77. As the name suggests, a self-index for a string supports both random access and pattern matching queries. In this paper we show how, given a string S[1..n] whose LZ77 parse consists of z phrases, we can store a self-index for S in O(z log(n/z)) space such that later, first, given a position i and a length , we can extract S[i..i + − 1] in O( + log n) time; second, given a pattern P [1..m], we can list the occ occurrences of P in S in O(m log m + occ log log n) time.
1
Introduction
With the advance of DNA-sequencing technologies comes the problem of how to store many individuals’ genomes compactly but such that we can search them quickly. Any two human genomes are almost the same but self-indexes based on compressed suffix arrays, the Burrows-Wheeler Transform or LZ78 (see [24] for a survey) do not take full advantage of this similarity [20]. Researchers have recently started building self-indexes based on context-free grammars and LZ77 [30], which better compress such highly repetitive strings. A self-index for a string S[1..n] stores S in compressed form such that later, first, given a position i and a length , we can quickly extract S[i..i + − 1]; second, given a pattern P [1..m], we can quickly list the occ occurrences of P in S. In this paper we describe a self-index that takes O(z log(n/z)) space, where z is the number of phrases in the LZ77 parse of S, and supports extraction in O( + log n) time and pattern matching in O(m log m + occ log log n) time. Our model throughout is the word RAM with Θ(log n)-bit words and we measure space in terms of these words. We assume S is over a fixed alphabet. Several authors have designed self-indexes for repetitive data sets: e.g., Arroyuelo, Navarro and Sadakane [3]; Claude and Navarro [7]; Do et al. [8]; Huang et al. [15]; Kreft and Navarro [19]; M¨akinen et al. [20]; Maruyama et al. [21]; Russo and Oliveira![25]; Wandelt et al. [28]; Yang et al. [29]; and ourselves [13]. Most of these indexes have bounds depending on values other than m, n and z, however, making them difficult to compare directly to our results in this paper. A. Pardo and A. Viola (Eds.): LATIN 2014, LNCS 8392, pp. 731–742, 2014. c Springer-Verlag Berlin Heidelberg 2014
732
T. Gagie et al.
Here we are concerned with theoretical worst-case bounds and, as far as we are aware, when any known self-index’s worst-case bounds are expressed in terms only of n and z, they are somehow worse than the bounds we give here. In Section 2 we briefly describe LZ77 and other preliminaries: straight-line programs, bookmarking, and Karp-Rabin fingerprinting. In Section 3 we describe K¨ arkk¨ainen and Ukkonen’s [16] LZ77-based index. Their index makes use of Patricia trees [22] and planar orthogonal range reporting [2,6], and we assume knowledge of these concepts here. Although not itself
Data Loading...