A firm foundation for statistical disclosure control

  • PDF / 1,091,295 Bytes
  • 26 Pages / 439.37 x 666.142 pts Page_size
  • 97 Downloads / 246 Views

DOWNLOAD

REPORT


Theory and Practice of Surveys

A firm foundation for statistical disclosure control Nobuaki Hoshino1 Received: 1 April 2020 / Accepted: 29 July 2020 © Japanese Federation of Statistical Science Associations 2020

Abstract The present article reviews the theory of data privacy and confidentiality in statistics and computer science, to modernize the theory of anonymization. This effort results in the mathematical definitions of identity disclosure and attribute disclosure applicable to even synthetic data. Also differential privacy is clarified as a method to bound the accuracy of population inference. This bound is derived by the Hammersley-Chapman-Robbins inequality, and it leads to the intuitive selection of the privacy budget 𝜖 of differential privacy. Keywords  Differential privacy · Population unique · Privacy budget · Synthetic data

1 Introduction The current practice of publishing official statistics faces distrust about the protection of identity. President’s Council of Advisors on Science and Technology (2014, pp. 38–39) states that “anonymization of a data record might seem easy to implement,” but “as the size and diversity of available data grows, the likelihood of being able to re-identify individuals grows substantially,” and “(anonymization) is not robust against near-term future re-identification methods.” The background of these statements seems the realized failures of anonymization in a private sector. Anonymization is not easy to implement at all. It actually requires artisanship for a future-proof data product. Therefore, apprentices have made errors. Statisticians should recognize that more efforts to theorize the artisanship of anonymization. So far the statistical theory of anonymization lacks the firm definition of anonymity, which should also cause the distrust. In computer science, Dwork (2006) proposes the notion of Differential privacy (DP), which is the package of a clear definition of data protection and a method easy to implement. These two factors are what statisticians’ artisanship lacks.

* Nobuaki Hoshino [email protected]‑u.ac.jp 1



School of Economics, Kanazawa University, Kakuma‑machi, Kanazawa 920‑1192, Japan

13

Vol.:(0123456789)



Japanese Journal of Statistics and Data Science

Accordingly, researches on DP have exploded. Zhu et al. (2017) guide us around a part of them. Even the practice of official statistics has been affected by DP. Reiter (2019) favorably introduces the role that DP can and does play in official statistics. However, as Ruggles et al. (2019) state, DP “goes far beyond what is necessary to keep data safe under census law and precedent.” The goal of DP is more stringent than that of traditional statistical practices. Hence DP tends to result in useless data for scientific purposes. Ruggles and others, in particular Bambauer et  al. (2013), criticize this little concern about data users. Weak statistical protection combined with other institutional measures constitutes our wisdom to publish usable data. However, Dwork’s easy protection method