A Post-randomization Method for Rigorous Identification Risk Control in Releasing Microdata

  • PDF / 1,173,790 Bytes
  • 16 Pages / 439.37 x 666.142 pts Page_size
  • 62 Downloads / 146 Views

DOWNLOAD

REPORT


A Post‑randomization Method for Rigorous Identification Risk Control in Releasing Microdata Xiaoyu Zhai1 · Tapan K. Nayak2,3 Accepted: 15 October 2020 © This is a U.S. government work and its text is not subject to copyright protection in the United States; however, its text may be subject to foreign copyright protection 2020

Abstract One significant concern in releasing survey microdata is the possibility of identifying the records of some survey units by matching the values of some of the variables, called key or pseudo-identifying variables, whose values can be obtained easily from other sources. For categorical key variables, Nayak et al. (Int Stat Rev 86(2): 300–321, 2018) developed a novel approach for measuring and controlling identification risks. For any 𝜉 > 1∕3 , it can guarantee that any unit’s probability of correct identification would not exceed 𝜉 . We present another post-randomization method for giving that guarantee more stringently, even for 𝜉 ≤ 1∕3 . We use data partitioning and unbiased post-randomization as two effective tools for preserving data utility. We illustrate and assess the procedure by applying it to a U.S. Census Bureau’s publicly released data set. Keywords  Identity disclosure · Key variable · Data partitioning · Postrandomization block · Data utility

The views expressed in this article are those of the authors and not those of the U.S. Census Bureau. X. Zhai: Her work was completed while she was a doctoral student at The George Washington University. * Tapan K. Nayak [email protected] 1

Facebook AI Applied Research, Menlo Park, CA 94025, USA

2

Center for Statistical Research and Methodology, U.S. Census Bureau, Washington, DC 20233, USA

3

Department of Statistics, George Washington University, Washington, DC 20052, USA



13

Vol.:(0123456789)

8 

Page 2 of 16

Journal of Statistical Theory and Practice

(2021) 15:8

1 Introduction One basic goal of most statistical agencies is to collect and release data to assist research and inform the public and policy makers. However, the original data may reveal private information about some of the survey participants or units, even if name, social security number and other direct identifiers are removed. In a microdata set that contains each unit’s values for many variables, one might be able to correctly identify the records of a target unit by matching the values of some of the variables, such as gender, race and occupation, which can be learned easily from other sources. Then, one can learn the identified unit’s values for all other variables. This is called identity disclosure and it is regarded as one of the most severe forms of exposing a respondent’s private information. In this paper, we focus on identity disclosure in microdata release and controlling identification risks. However, data confidentiality breaches may occur in many other forms and even when only data summaries are released. For discussions about other types of disclosure and various disclosure control methods, such as grouping, data swapping, cell suppression,