BestNeighbor: efficient evaluation of kNN queries on large time series databases

  • PDF / 1,567,662 Bytes
  • 30 Pages / 439.37 x 666.142 pts Page_size
  • 56 Downloads / 176 Views

DOWNLOAD

REPORT


BestNeighbor: efficient evaluation of kNN queries on large time series databases Oleksandra Levchenko, et al. [full author details at the end of the article] Received: 27 February 2020 / Revised: 29 September 2020 / Accepted: 4 October 2020 © Springer-Verlag London Ltd., part of Springer Nature 2020

Abstract This paper presents parallel solutions (developed based on two state-of-the-art algorithms iSAX and sketch) for evaluating k nearest neighbor queries on large databases of time series, compares them based on various measures of quality and time performance, and offers a tool that uses the characteristics of application data to determine which algorithm to choose for that application and how to set the parameters for that algorithm. Specifically, our experiments show that: (i) iSAX and its derivatives perform best in both time and quality when the time series can be characterized by a few low-frequency Fourier Coefficients, a regime where the iSAX pruning approach works well. (ii) iSAX performs significantly less well when highfrequency Fourier Coefficients have much of the energy of the time series. (iii) A random projection approach based on sketches by contrast is more or less independent of the frequency power spectrum. The experiments show the close relationship between pruning ratio and time for exact iSAX as well as between pruning ratio and the quality of approximate iSAX. Our toolkit analyzes typical time series of an application (i) to determine optimal segment sizes for iSAX and (ii) when to use Parallel Sketches instead of iSAX. Our algorithms have been implemented using Spark, evaluated over a cluster of nodes, and have been applied to both real and synthetic data. The results apply to any databases of numerical sequences, whether or not they relate to time. Keywords Time series · Parallel indexing · Distributed querying · Fourier coefficients · Frequency analysis · Filtering

1 Introduction Nowadays people are able to monitor various indicators for their personal activities (e.g., through smart-meters or smart-plugs for electricity or water consumption), or professional activities (e.g., through the sensors installed on plants by farmers). Sensor technology is also improving over time and the number of sensors is increasing, e.g., in finance and seismic studies. This results in the production of large and complex data, usually in the form of time series (or TS in short) [13,18,26,29,30,33,34,36,40,49] that challenge knowledge discovery.

B

Oleksandra Levchenko [email protected]

Extended author information available on the last page of the article

123

O. Levchenko et al.

With such complex and massive sets of time series, fast and accurate similarity search is critical to performing many data mining tasks like shapelets, motifs discovery, classification, or clustering [28,33,52]. In order to improve the performance of such similarity queries, indexing [8–10,31] has been successfully used in a variety of settings and applications [2,5,11,20,27,42,44,50]. In this work, we focus on tw