Design-Unbiased Statistical Learning in Survey Sampling

PDF / 362,194 Bytes
31 Pages / 439.37 x 666.142 pts Page_size
93 Downloads / 251 Views

Design-Unbiased Statistical Learning in Survey Sampling Luis Sanguiao Sande Instituto Nacional de Estad´ıstica, Madrid, Spain

Li-Chun Zhang University of Southampton, Southampton, UK Statistisk Sentralbyraa, Oslo, Norway Universitetet i Oslo, Oslo, Norway

Abstract Design-consistent model-assisted estimation has become the standard practice in survey sampling. However, design consistency remains to be established for many machine-learning techniques that can potentially be very powerful assisting models. We propose a subsampling Rao-Blackwell method, and develop a statistical learning theory for exactly design-unbiased estimation with the help of linear or non-linear prediction models. Our approach makes use of classic ideas from Statistical Science as well as the rapidly growing ﬁeld of Machine Learning. Provided rich auxiliary information, it can yield considerable eﬃciency gains over standard linear model-assisted methods, while ensuring valid estimation for the given target population, which is robust against potential mis-speciﬁcations of the assisting model, even if the design consistency of following the standard recipe for plug-in model-assisted estimator cannot be established. AMS (2000) subject classiﬁcation. Primary 62D05; Secondary 62G05. Keywords and phrases. Rao-Blackwellisation, Bagging, pq-unbiasedness, Stability conditions

1 Introduction To make use of the available auxiliary information, approximately designunbiased model-assisted estimation has become the standard practice in survey sampling, following the inﬂuential works such as S¨arndal et al. (1992) & Deville and S¨arndal (1992). Breidt and Opsomer (2017) review the “general recipe”, and give examples of many Machine Learning (ML) techniques that

2

L. Sanguiao Sande and L. -C. Zhang

can be or have been embedded in the model-assisted framework, such as Kernel methods, splines or neural networks. McConville and Toth (2019) observe that many of these sophisticated estimators “are rarely ever used by statistical agencies to produce oﬃcial statistics”, because the underlying “models are often ill suited to the available auxiliary data.” McConville and Toth (2019) promote intuitively the regression tree estimator as a post-stratiﬁcation estimator, where the post-strata are selected by the recursive partitioning algorithm that generates the regression tree based on the available data. The technical development of the regression tree estimator has taken many years. Gordon and Olshen (1978, 1980) establish the consistency of recursive partitioning algorithms, given independent and identically distributed (IID) training data. Toth and Eltinge (2011) extend their result, allowing for ﬁnite-population sampling in addition to the IID super-population model. McConville and Toth (2019) adapt the method of proof of Toth and Eltinge (2011) to establish the design-consistency of the post-stratiﬁcation estimator corresponding to the sample-trained regression tree. A closely related model is random forest (RF), consisting of many trees built on randomly sel

Data Loading...

Design-Unbiased Statistical Learning in Survey Sampling

Recommend Documents

Statistical data integration in survey sampling: a review

Sampling Techniques for Statistical Databases

Statistical Learning in Perception

Source Data Verification by Statistical Sampling: Issues in Implementation

Statistical Query Learning

Reconceptualizing Statistical Learning in the Early Years

Statistical Learning and Kernel Methods

Representation learning in discourse parsing: A survey

Statistical Learning and Its Consequences

Neural Networks and Statistical Learning

Machine learning in neurosurgery: a global survey

Imbalanced Continual Learning with Partitioning Reservoir Sampling