Streaming statistical models via Merge & Reduce

  • PDF / 1,238,039 Bytes
  • 17 Pages / 595.276 x 790.866 pts Page_size
  • 5 Downloads / 198 Views

DOWNLOAD

REPORT


REGULAR PAPER

Streaming statistical models via Merge & Reduce Leo N. Geppert1 · Katja Ickstadt1 · Alexander Munteanu1,2 · Christian Sohler3 Received: 9 August 2019 / Accepted: 25 May 2020 © The Author(s) 2020

Abstract Merge & Reduce is a general algorithmic scheme in the theory of data structures. Its main purpose is to transform static data structures—that support only queries—into dynamic data structures—that allow insertions of new elements—with as little overhead as possible. This can be used to turn classic offline algorithms for summarizing and analyzing data into streaming algorithms. We transfer these ideas to the setting of statistical data analysis in streaming environments. Our approach is conceptually different from previous settings where Merge & Reduce has been employed. Instead of summarizing the data, we combine the Merge & Reduce framework directly with statistical models. This enables performing computationally demanding data analysis tasks on massive data sets. The computations are divided into small tractable batches whose size is independent of the total number of observations n. The results are combined in a structured way at the cost of a bounded O(log n) factor in their memory requirements. It is only necessary, though nontrivial, to choose an appropriate statistical model and design merge and reduce operations on a casewise basis for the specific type of model. We illustrate our Merge & Reduce schemes on simulated and real-world data employing (Bayesian) linear regression models, Gaussian mixture models and generalized linear models. Keywords Mergeable statistical models · Large data · Streaming · Distributed · Regression analysis

1 Introduction In recent times, data sets with a massive number of observations have become more and more present, making scalability one of the main challenges of modern data analysis. For many statistical methods, these amounts of data lead to an enormous consumption of resources. A prominent example is linear regression, an important statistical tool in both frequentist and Bayesian settings. On very large data sets with n observations and d variables (n  d), carrying out regres-

B

Leo N. Geppert [email protected] Katja Ickstadt [email protected] Alexander Munteanu [email protected] Christian Sohler [email protected]

1

Department of Statistics, Technische Universität Dortmund, 44221 Dortmund, Germany

2

Department of Computer Science, Technische Universität Dortmund, 44221 Dortmund, Germany

3

Institute of Computer Science, The University of Cologne, Weyertal 121, 50931 Köln, Germany

sion analysis becomes increasingly demanding concerning running time and memory consumption making the analysis practically tedious or even impossible. This is especially true when the data do not fit into the fast internal memory but has to be read over and over again from slow external memory devices or databases, which blows up the wall-clock time, i.e., the actual elapsed time, by orders of magnitude. In this pape