LZ77-Based Self-indexing with Faster Pattern Matching

To store and search genomic databases efficiently, researchers have recently started building self-indexes based on LZ77. As the name suggests, a self-index for a string supports both random access and pattern matching queries. In this paper we show how,

PDF / 242,203 Bytes
12 Pages / 439.363 x 666.131 pts Page_size
66 Downloads / 262 Views

DOWNLOAD

REPORT

University of Helsinki, Finland Max Planck Institute, Germany University of Kansas, United States

2 3

Abstract. To store and search genomic databases eﬃciently, researchers have recently started building self-indexes based on LZ77. As the name suggests, a self-index for a string supports both random access and pattern matching queries. In this paper we show how, given a string S[1..n] whose LZ77 parse consists of z phrases, we can store a self-index for S in O(z log(n/z)) space such that later, ﬁrst, given a position i and a length , we can extract S[i..i + − 1] in O( + log n) time; second, given a pattern P [1..m], we can list the occ occurrences of P in S in O(m log m + occ log log n) time.

1

Introduction

With the advance of DNA-sequencing technologies comes the problem of how to store many individuals’ genomes compactly but such that we can search them quickly. Any two human genomes are almost the same but self-indexes based on compressed suﬃx arrays, the Burrows-Wheeler Transform or LZ78 (see [24] for a survey) do not take full advantage of this similarity [20]. Researchers have recently started building self-indexes based on context-free grammars and LZ77 [30], which better compress such highly repetitive strings. A self-index for a string S[1..n] stores S in compressed form such that later, ﬁrst, given a position i and a length , we can quickly extract S[i..i + − 1]; second, given a pattern P [1..m], we can quickly list the occ occurrences of P in S. In this paper we describe a self-index that takes O(z log(n/z)) space, where z is the number of phrases in the LZ77 parse of S, and supports extraction in O( + log n) time and pattern matching in O(m log m + occ log log n) time. Our model throughout is the word RAM with Θ(log n)-bit words and we measure space in terms of these words. We assume S is over a ﬁxed alphabet. Several authors have designed self-indexes for repetitive data sets: e.g., Arroyuelo, Navarro and Sadakane [3]; Claude and Navarro [7]; Do et al. [8]; Huang et al. [15]; Kreft and Navarro [19]; M¨akinen et al. [20]; Maruyama et al. [21]; Russo and Oliveira![25]; Wandelt et al. [28]; Yang et al. [29]; and ourselves [13]. Most of these indexes have bounds depending on values other than m, n and z, however, making them diﬃcult to compare directly to our results in this paper. A. Pardo and A. Viola (Eds.): LATIN 2014, LNCS 8392, pp. 731–742, 2014. c Springer-Verlag Berlin Heidelberg 2014

732

T. Gagie et al.

Here we are concerned with theoretical worst-case bounds and, as far as we are aware, when any known self-index’s worst-case bounds are expressed in terms only of n and z, they are somehow worse than the bounds we give here. In Section 2 we brieﬂy describe LZ77 and other preliminaries: straight-line programs, bookmarking, and Karp-Rabin ﬁngerprinting. In Section 3 we describe K¨ arkk¨ainen and Ukkonen’s [16] LZ77-based index. Their index makes use of Patricia trees [22] and planar orthogonal range reporting [2,6], and we assume knowledge of these concepts here. Although not itself

Data Loading...

LZ77-Based Self-indexing with Faster Pattern Matching

Recommend Documents

Pattern Matching

Two-Dimensional Pattern Matching with Scaling

Compressed Pattern Matching

Contextual Pattern Matching

Outsole Pattern Matching

Point Pattern Matching

Multidimensional Compressed Pattern Matching

Bounded Pattern Matching Using Views

Relational Synthesis for Pattern Matching

Two-Dimensional Scaled Pattern Matching

Pattern Matching in RNA Structures

Pattern Recognition and String Matching