Design-Unbiased Statistical Learning in Survey Sampling
- PDF / 362,194 Bytes
- 31 Pages / 439.37 x 666.142 pts Page_size
- 93 Downloads / 239 Views
Design-Unbiased Statistical Learning in Survey Sampling Luis Sanguiao Sande Instituto Nacional de Estad´ıstica, Madrid, Spain
Li-Chun Zhang University of Southampton, Southampton, UK Statistisk Sentralbyraa, Oslo, Norway Universitetet i Oslo, Oslo, Norway
Abstract Design-consistent model-assisted estimation has become the standard practice in survey sampling. However, design consistency remains to be established for many machine-learning techniques that can potentially be very powerful assisting models. We propose a subsampling Rao-Blackwell method, and develop a statistical learning theory for exactly design-unbiased estimation with the help of linear or non-linear prediction models. Our approach makes use of classic ideas from Statistical Science as well as the rapidly growing field of Machine Learning. Provided rich auxiliary information, it can yield considerable efficiency gains over standard linear model-assisted methods, while ensuring valid estimation for the given target population, which is robust against potential mis-specifications of the assisting model, even if the design consistency of following the standard recipe for plug-in model-assisted estimator cannot be established. AMS (2000) subject classification. Primary 62D05; Secondary 62G05. Keywords and phrases. Rao-Blackwellisation, Bagging, pq-unbiasedness, Stability conditions
1 Introduction To make use of the available auxiliary information, approximately designunbiased model-assisted estimation has become the standard practice in survey sampling, following the influential works such as S¨arndal et al. (1992) & Deville and S¨arndal (1992). Breidt and Opsomer (2017) review the “general recipe”, and give examples of many Machine Learning (ML) techniques that
2
L. Sanguiao Sande and L. -C. Zhang
can be or have been embedded in the model-assisted framework, such as Kernel methods, splines or neural networks. McConville and Toth (2019) observe that many of these sophisticated estimators “are rarely ever used by statistical agencies to produce official statistics”, because the underlying “models are often ill suited to the available auxiliary data.” McConville and Toth (2019) promote intuitively the regression tree estimator as a post-stratification estimator, where the post-strata are selected by the recursive partitioning algorithm that generates the regression tree based on the available data. The technical development of the regression tree estimator has taken many years. Gordon and Olshen (1978, 1980) establish the consistency of recursive partitioning algorithms, given independent and identically distributed (IID) training data. Toth and Eltinge (2011) extend their result, allowing for finite-population sampling in addition to the IID super-population model. McConville and Toth (2019) adapt the method of proof of Toth and Eltinge (2011) to establish the design-consistency of the post-stratification estimator corresponding to the sample-trained regression tree. A closely related model is random forest (RF), consisting of many trees built on randomly sel
Data Loading...