Fast Evaluation of Union-Intersection Expressions
We show how to represent sets in a linear space data structure such that expressions involving unions and intersections of sets can be computed in a worst-case efficient way. This problem has applications in e.g. information retrieval and database systems
- PDF / 457,200 Bytes
- 12 Pages / 430 x 660 pts Page_size
- 78 Downloads / 193 Views
Abstract. We show how to represent sets in a linear space data structure such that expressions involving unions and intersections of sets can be computed in a worst-case efficient way. This problem has applications in e.g. information retrieval and database systems. We mainly consider the RAM model of computation, and sets of machine words, but also state our results in the I/O model. On a RAM with word size w, a special case of our result is that the intersection of m (preprocessed) sets, containing n elements in total, can be computed in expected time O(n(log w)2 /w + km), where k is the number of elements in the intersection. If the first of the two terms dominates, this is a factor w1−o(1) faster than the standard solution of merging sorted lists. We show a k )k), meaning cell probe lower bound of time Ω(n/(wm log m) + (1 − log w that our upper bound is nearly optimal for small m. Our algorithm uses a novel combination of approximate set representations and word-level parallelism.
1
Introduction
Algorithms and data structures for sets play an important role in computer science. For example, the relational data model, which has been the dominant database paradigm for decades, is based on set representation and manipulation. Set operations also arise naturally in connection with database queries that can be expressed as a boolean combination of simpler queries. For example, search engines report documents that are present in the intersection of several sets of documents, each corresponding to a word in the query. If we fix the set of documents to be searched, it is possible to spend time on preprocessing all sets, to decrease the time for answering queries. The search engine application has been the main motivation in several recent works on computing set intersections [14, 5, 13]. All these papers assume that elements are taken from an ordered set, and are accessed through comparisons. In particular, creating the canonical representation, a sorted list, is the best possible preprocessing in this context. The comparison-based model rules out some algorithms that are very efficient, both in theory and practice. For example, if the preprocessing produces a hashing-based dictionary for each set, the intersection of two sets S1 and S2 can be computed in expected time O(min(|S1 |, |S2 |)). This T. Tokuyama (Ed.): ISAAC 2007, LNCS 4835, pp. 739–750, 2007. c Springer-Verlag Berlin Heidelberg 2007
740
P. Bille, A. Pagh, and R. Pagh
1 | |S2 | is a factor Θ(log(1 + max( |S |S2 | , |S1 | ))) faster than the best possible worst-case performance of comparison-based algorithms. In this paper we investigate non-comparison-based techniques for evaluating expressions involving unions and intersections of sets on a RAM. (In the search engine application this corresponds to expressions using AND and OR operators.) Specifically, we consider the situation in which each set is required to be represented in a linear space data structure, and propose the multi-resolution set representation, which is suitable for efficient set operations
Data Loading...