Fast Evaluation of Union-Intersection Expressions

We show how to represent sets in a linear space data structure such that expressions involving unions and intersections of sets can be computed in a worst-case efficient way. This problem has applications in e.g. information retrieval and database systems

PDF / 457,200 Bytes
12 Pages / 430 x 660 pts Page_size
78 Downloads / 218 Views

DOWNLOAD

REPORT

Abstract. We show how to represent sets in a linear space data structure such that expressions involving unions and intersections of sets can be computed in a worst-case eﬃcient way. This problem has applications in e.g. information retrieval and database systems. We mainly consider the RAM model of computation, and sets of machine words, but also state our results in the I/O model. On a RAM with word size w, a special case of our result is that the intersection of m (preprocessed) sets, containing n elements in total, can be computed in expected time O(n(log w)2 /w + km), where k is the number of elements in the intersection. If the ﬁrst of the two terms dominates, this is a factor w1−o(1) faster than the standard solution of merging sorted lists. We show a k )k), meaning cell probe lower bound of time Ω(n/(wm log m) + (1 − log w that our upper bound is nearly optimal for small m. Our algorithm uses a novel combination of approximate set representations and word-level parallelism.

1

Introduction

Algorithms and data structures for sets play an important role in computer science. For example, the relational data model, which has been the dominant database paradigm for decades, is based on set representation and manipulation. Set operations also arise naturally in connection with database queries that can be expressed as a boolean combination of simpler queries. For example, search engines report documents that are present in the intersection of several sets of documents, each corresponding to a word in the query. If we ﬁx the set of documents to be searched, it is possible to spend time on preprocessing all sets, to decrease the time for answering queries. The search engine application has been the main motivation in several recent works on computing set intersections [14, 5, 13]. All these papers assume that elements are taken from an ordered set, and are accessed through comparisons. In particular, creating the canonical representation, a sorted list, is the best possible preprocessing in this context. The comparison-based model rules out some algorithms that are very eﬃcient, both in theory and practice. For example, if the preprocessing produces a hashing-based dictionary for each set, the intersection of two sets S1 and S2 can be computed in expected time O(min(|S1 |, |S2 |)). This T. Tokuyama (Ed.): ISAAC 2007, LNCS 4835, pp. 739–750, 2007. c Springer-Verlag Berlin Heidelberg 2007

740

P. Bille, A. Pagh, and R. Pagh

1 | |S2 | is a factor Θ(log(1 + max( |S |S2 | , |S1 | ))) faster than the best possible worst-case performance of comparison-based algorithms. In this paper we investigate non-comparison-based techniques for evaluating expressions involving unions and intersections of sets on a RAM. (In the search engine application this corresponds to expressions using AND and OR operators.) Speciﬁcally, we consider the situation in which each set is required to be represented in a linear space data structure, and propose the multi-resolution set representation, which is suitable for eﬃcient set operations

Data Loading...

Fast Evaluation of Union-Intersection Expressions

Recommend Documents

Design and Evaluation of Affective Expressions of a Zoomorphic Robot

Evaluation of Knowledge-Based Recognition of Spatial Expressions for Polish

Regular Expressions

Integer Expressions

Regular Expressions

Fast and accurate evaluation of dual Bernstein polynomials

Working with Flow Expressions

Manipulating algebraic expressions

Shape Constraints and Expressions

Whole Numbers, Analysis of Arithmetic Expressions

N3X: Notation3 with SPARQL Expressions

Women and Revolution: Global Expressions