Statistical data integration in survey sampling: a review

  • PDF / 1,496,594 Bytes
  • 26 Pages / 439.37 x 666.142 pts Page_size
  • 43 Downloads / 244 Views

DOWNLOAD

REPORT


Theory and Practice of Surveys

Statistical data integration in survey sampling: a review Shu Yang1 · Jae Kwang Kim2  Received: 6 January 2020 / Accepted: 13 September 2020 © Japanese Federation of Statistical Science Associations 2020

Abstract Finite population inference is a central goal in survey sampling. Probability sampling is the main statistical approach to finite population inference. Challenges arise due to high cost and increasing non-response rates. Data integration provides a timely solution by leveraging multiple data sources to provide more robust and efficient inference than using any single data source alone. The technique for data integration varies depending on types of samples and available information to be combined. This article provides a systematic review of data integration techniques for combining probability samples, probability and non-probability samples, and probability and big data samples. We discuss a wide range of integration methods such as generalized least squares, calibration weighting, inverse probability weighting, mass imputation, and doubly robust methods. Finally, we highlight important questions for future research. Keywords  Generalizability · Meta-analysis · Missing at random · Transportability

1 Introduction Probability sampling is regarded as the gold standard in survey statistics for finite population inference. Fundamentally, probability samples are selected under known sampling designs and, therefore, are representative of the target population. Because the selection probability is known, the subsequent inference from a probability sample is often design-based and respects the way in which the data were collected; see Särndal et  al. (2003), Cochran (1977) and Fuller (2009) for textbook discussions. Kalton (2019) provided a comprehensive overview of the survey sampling research in the last 60 years. However, many practical challenges arise in collecting and analyzing probability sample data (Baker et al. 2013; Keiding and Louis 2016). Large-scale survey * Jae Kwang Kim [email protected] 1

Department of Statistics, North Carolina State University, Raleigh, USA

2

Department of Statistics, Iowa State University, Ames, USA



13

Vol.:(0123456789)



Japanese Journal of Statistics and Data Science

programs continually face heightened demands coupled with reduced resources. Demands include requests for estimates for domains with small sample sizes and desires for more timely estimates. Simultaneously, program budget cuts force reductions in sample sizes, and decreasing response rates make non-response bias an important concern. Data integration is a new area of research to provide a timely solution to the above challenges. The goal is multi-fold: (1) minimize the cost associated with surveys, (2) minimize the respondent burden, and (3) maximize the statistical information or equivalently the efficiency of survey estimation. Narrowly speaking, survey integration means combining separate probability samples into one survey instrument (Bycroft 2010). Broadly speakin