Disclosure Risk Assessment for Sample Microdata Through Probabilistic Modeling

Disclosure risk occurs when there is a high probability that an intruder can identify an individual in released sample microdata and confidential information may be revealed. For some social surveys, the population from which the sample is drawn is genera

  • PDF / 556,115 Bytes
  • 27 Pages / 439.37 x 666.142 pts Page_size
  • 67 Downloads / 264 Views

DOWNLOAD

REPORT


Disclosure Risk Assessment for Sample Microdata Through Probabilistic Modeling Natalie Shlomo

Abstract Disclosure risk occurs when there is a high probability that an intruder can identify an individual in released sample microdata and confidential information may be revealed. For some social surveys, the population from which the sample is drawn is generally not known or only partially known through marginal distributions. The identification is made possible through the use of a key, which is a combination of indirectly identifying variables, such as age, sex, and place of residence. Disclosure risk measures are based on the notion of population uniqueness in the key. In order to quantify the disclosure risk, probabilistic models are defined based on distributional assumptions about the population counts according to the observed sample counts. The parameters for the distribution are estimated through log-linear models. The model selection criteria is based on a ‘minimum error’ test using a forward search algorithm. The methods are expanded to cover the case of complex survey designs and misclassification on the key variables, either arising from the survey process or as a result of perturbative disclosure control techniques that may have been applied to the data. Variance and confidence intervals of estimated disclosure risk measures are also addressed. The methods are demonstrated on real data drawn from extracts of the 2001 UK Census. Possible extensions to the probabilistic modeling are presented based on a local polynomial regression smoothing technique in neighborhoods of the cells of the key.

4.1 Introduction Statistical agencies face growing demands for the release of microdata while under legal, moral, and ethical obligations to preserve the confidentiality of respondents. The microdata released are generally based on samples arising from social surveys where the statistical unit is a household or an individual. Microdata from business

N. Shlomo (B) Southampton Statistical Sciences Research Institute, University of Southampton, Southampton, SO17 1BJ, UK e-mail: [email protected]

J. Nin, J. Herranz (eds.), Privacy and Anonymity in Information Management Systems, Advanced Information and Knowledge Processing, C Springer-Verlag London Limited 2010 DOI 10.1007/978-1-84996-238-4_4, 

63

64

N. Shlomo

surveys are typically not released because of their disclosive nature due to high sampling fractions and skewed distributions. Many statistical agencies have set up provisions for providing access to sample microdata arising from social surveys for research purposes under different modes of access, for example, public-use files, microdata under contract, special license agreements, on-site data labs, and data archives. Each of these modes of access might have different levels of disclosure risk protection depending on who is requesting the data. New developments in remote access and remote computation servers are possible solutions for data release and pose specific challenges for applying web-based dis