A partitioned quasi-likelihood for distributed statistical inference

  • PDF / 606,301 Bytes
  • 20 Pages / 439.37 x 666.142 pts Page_size
  • 37 Downloads / 207 Views

DOWNLOAD

REPORT


A partitioned quasi-likelihood for distributed statistical inference Guangbao Guo1 · Yue Sun1 · Xuejun Jiang2 Received: 23 October 2018 / Accepted: 3 March 2020 © Springer-Verlag GmbH Germany, part of Springer Nature 2020

Abstract In the big data setting, working data sets are often distributed on multiple machines. However, classical statistical methods are often developed to solve the problems of single estimation or inference. We employ a novel parallel quasi-likelihood method in generalized linear models, to make the variances between different sub-estimators relatively similar. Estimates are obtained from projection subsets of data and later combined by suitably-chosen unknown weights. We also show the proposed method to produce better asymptotic efficiency than using the simple average. Furthermore, simulation examples show that the proposed method can significantly improve statistical inference. Keywords Distributed statistical inference · Parallel computing · Quasi-likelihood · Projection matrix · Distributed data

1 Introduction Distributed data receives more attention in the modern big data era, due to the development of data storage and computation environment. Nevertheless, the aggregation procedures of statistical inferences meet unprecedented opportunities and challenges. It is noted that parallel and distributed statistical inference has been becoming an important and popular research topic. For example, distributed estimators are obtained through parallel subsets, e.g., Battey et al. (2015), Huang and Huo (2015), Sengupta et al. (2016), Hasenclever et al. (2017), among others. Parallel MCMC methods are

Electronic supplementary material The online version of this article (https://doi.org/10.1007/s00180020-00974-4) contains supplementary material, which is available to authorized users.

B

Guangbao Guo [email protected]

1

Department of Statistics, Shandong University of Technology, Zibo 255000, China

2

Department of Mathematics, Southern University of Science and Technology, Shenzhen 518000, China

123

G. Guo et al.

proposed through functional decomposition, see Pratola et al. (2014), Song and Liang (2015), and among others. Scalable methods can be achieved by using weighted averages, e.g., Kleiner et al. (2014), Guo et al. (2015), and Owen et al. (2015). Specifically, Battey et al. employed a divide-and-conquer procedure, to derive distributed test and estimation of distributed data. They also have mainly discussed the selection of the optimal number of disjoint subsets. Through dividing data set into several disjoint subsets, Huang and Huo devised a one-step averaging estimation to improve the performance of the distributed estimator. To estimate the precision of distributed inference methods, Sengupta et al. developed a subsample double bootstrap to cover distributed data subsets, and discussed the problem of optimal block length of it. Hasenclever et al. proposed a distributed Bayesian learning architecture to deal with disjoint data subsets, and developed stochastic natural-gradient e