Stratified random sampling from streaming and stored data

  • PDF / 3,512,611 Bytes
  • 46 Pages / 439.37 x 666.142 pts Page_size
  • 24 Downloads / 195 Views

DOWNLOAD

REPORT


Stratified random sampling from streaming and stored data Trong Duc Nguyen1 · Ming‑Hung Shih1 · Divesh Srivastava2 · Srikanta Tirthapura1 · Bojian Xu3 Accepted: 9 October 2020 © Springer Science+Business Media, LLC, part of Springer Nature 2020

Abstract Stratified random sampling (SRS) is a widely used sampling technique for approximate query processing. We consider SRS on continuously arriving data streams and statically stored data sets. We present a tight lower bound showing that any streaming algorithm for SRS over the entire stream must have, in the worst case, a variance that is 𝛺(r) factor away from the optimal, where r is the number of strata. We present S-VOILA, a practical streaming algorithm for SRS over the entire stream that is locally variance-optimal. We prove that any sliding window-based streaming SRS needs a workspace of 𝛺(rM log W) in the worst case, to maintain a variance-optimal SRS of size M, where W is the number of elements in the sliding window. Due to the inherent high workspace needs for sliding window-based SRS, we present SW-VOILA, a multi-layer practical sampling algorithm that uses only O(M) workspace but can maintain an SRS of size close to M in practice over a sliding window. Experiments show that both S-VOILA and SW-VOILA result in a variance that is typically close to their optimal offline counterparts, which was given the entire input beforehand. We also present VOILA, a variance-optimal offline algorithm for stratified random sampling. VOILA is a strict generalization of the well-known Neyman allocation, which is optimal only under the assumption that each stratum is abundant. Experiments show that VOILA can have significantly smaller variance (1.4x to 50x) than Neyman allocation on real-world data. Keywords  Stratified random sampling · Stream sampling · Sliding window sampling · Neyman allocation

A preliminary version of this work appears in [1]. * Trong Duc Nguyen [email protected] 1

Iowa State University, Ames, USA

2

AT&T - Research, Austin, USA

3

Eastern Washington University, Cheney, USA



13

Vol.:(0123456789)



Distributed and Parallel Databases

1 Introduction Random sampling is a widely-used method for data analysis, and features prominently in the toolbox of virtually every approximate query processing system. The power of random sampling lies in its generality. For many important classes of queries, an approximate answer whose error is small in a statistical sense can be efficiently obtained through executing the query over an appropriately derived random sample. Sampling operators are part of all major database products, e.g., Oracle, Microsoft SQL Server, and IBM Db2. The simplest method for random sampling is uniform random sampling, where each element from the entire data (the “population”) is chosen with the same probability. Uniform random sampling may however lead to a high variance in estimation. For instance, consider a population D = {1, 2, 4, 2, 1, 1050, 1000, 1200, 1300} , and suppose we wanted to estimate the population mean. A uniform random