Indexing, R*-Tree

  • PDF / 3,166,563 Bytes
  • 132 Pages / 547.087 x 737.008 pts Page_size
  • 48 Downloads / 146 Views

DOWNLOAD

REPORT


Identity Aware LBS  Privacy Threats in Location-Based Services

Identity Unaware LBS  Privacy Threats in Location-Based Services

iDistance Techniques H.V. JAGADISH 1, B ENG C HIN O OI2 , RUI Z HANG 3 Department of Electrical Engineering and Computer Science, University of Michigan, Ann Arbor, MI, USA 2 Department of Computer Science, National University of Singapore, Singapore, Singapore 3 Department of Computer Science and Software Engineering, The University of Melbourne, Parkville, VIC, Australia

1

as reported in [3]. Building the iDistance index has two steps. First, a number of reference points in the data space are chosen. There are various ways of choosing reference points. Using cluster centers as reference points is the most efficient way. Second, the distance between a data point and its closest reference point is calculated. This distance plus a scaling value is called the point’s iDistance. By this means, points in a multi-dimensional space are mapped to one-dimensional values, and then a B+ -tree can be adopted to index the points using the iDistance as the key. A kNN search is mapped to a number of one-dimensional range searches, which can be processed efficiently on a B+ -tree. The iDistance technique can be viewed as a way of accelerating the sequential scan. Instead of scanning records from the beginning to the end of the data file, the iDistance starts the scan from spots where the nearest neighbors can be obtained early with a very high probability. Historical Background

Query, nearest neighbor; Scan, sequential

The iDistance was first proposed by Cui Yu, Beng Chin Ooi, Kian-Lee Tan and H. V. Jagadish in 2001 [5]. Later, together with Rui Zhang, they improved the technique and performed a more comprehensive study on it in 2005 [3].

Definition

Scientific Fundamentals

The iDistance is an indexing and query processing technique for k nearest neighbor (kNN) queries on point data in multi-dimensional metric spaces. The kNN query is one of the hardest problems on multi-dimensional data. It has been shown analytically and experimentally that any algorithm using hierarchical index structure based on either space- or data-partitioning is less efficient than the naive method of sequentially checking every data record (called the sequential scan) in high-dimensional spaces [4]. Some data distributions including the uniform distribution are particularly hard cases [1]. The iDistance is designed to process kNN queries in high-dimensional spaces efficiently and it is especially good for skewed data distributions, which usually occur in real-life data sets. For uniform data, the iDistance beats the sequential scan up to 30 dimensions

Figure 1 shows an example of how the iDistance works. The black dots are data points and the gray dots are reference points. The number of reference points is a tunable parameter, denoted by N r . The recommended value for N r is between 60 and 80. In this example, N r = 3. At first, 3 cluster centers of the data points, O1 , O2 , O3 are identified using a clustering alg