Efficient locality-sensitive hashing over high-dimensional streaming data

  • PDF / 955,873 Bytes
  • 14 Pages / 595.276 x 790.866 pts Page_size
  • 40 Downloads / 237 Views

DOWNLOAD

REPORT


(0123456789().,-volV)(0123456789(). ,- volV)

S.I. : DEEP GEOSPATIAL DATA UNDERSTANDING

Efficient locality-sensitive hashing over high-dimensional streaming data Hao Wang1 • Chengcheng Yang2,3



Xiangliang Zhang2 • Xin Gao1

Received: 20 July 2020 / Accepted: 2 September 2020  Springer-Verlag London Ltd., part of Springer Nature 2020

Abstract Approximate nearest neighbor (ANN) search in high-dimensional spaces is fundamental in many applications. Localitysensitive hashing (LSH) is a well-known methodology to solve the ANN problem. Existing LSH-based ANN solutions typically employ a large number of individual indexes optimized for searching efficiency. Updating such indexes might be impractical when processing high-dimensional streaming data. In this paper, we present a novel disk-based LSH index that offers efficient support for both searches and updates. The contributions of our work are threefold. First, we use the writefriendly LSM-trees to store the LSH projections to facilitate efficient updates. Second, we develop a novel estimation scheme to estimate the number of required LSH functions, with which the disk storage and access costs are effectively reduced. Third, we exploit both the collision number and the projection distance to improve the efficiency of candidate selection, improving the search performance with theoretical guarantees on the result quality. Experiments on four realworld datasets show that our proposal outperforms the state-of-the-art schemes. Keywords Approximate nearest neighbor search  Locality-sensitive hashing  LSM-tree  Streaming data

1 Introduction Nearest neighbor (NN) search in high-dimensional Euclidean space plays an essential role in applications from many domains, including multimedia databases, recommender systems, statistical machine learning, bioinformatics, etc. In such applications, data objects are typically & Chengcheng Yang [email protected] Hao Wang [email protected] Xiangliang Zhang [email protected] Xin Gao [email protected] 1

Computational Bioscience Research Center, CEMSE Division, King Abdullah University of Science and Technology, Thuwal, Saudi Arabia

2

Machine Intelligence and kNowledge Engineering Laboratory, CEMSE Division, King Abdullah University of Science and Technology, Thuwal, Saudi Arabia

3

Shenzhen University, Shenzhen, China

characterized as feature vectors/points in multi-dimensional spaces. There are numerous studies on NN search in multi-dimensional spaces, yet it has been shown that the performance sharply degenerates as the dimensionality increases [1]. Thus, approximate nearest neighbor (ANN) search has received much attention, as it can be much more efficient with theoretically guaranteed quality of search results. Locality-sensitive hashing (LSH) [2] is a widely used technique that has shown significant promise in high-dimensional ANN search. LSH solves the c-approximate nearest neighbor (c-ANN) problem, which, given a query object q, aims at finding an object o in the database such that the distan