PCPs and the Hardness of Generating Synthetic Data
- PDF / 473,420 Bytes
- 35 Pages / 439.37 x 666.142 pts Page_size
- 33 Downloads / 223 Views
PCPs and the Hardness of Generating Synthetic Data∗ Jonathan Ullman Khoury College of Computer Science, Northeastern University, Boston, MA, USA [email protected]
Salil Vadhan Harvard John A. Paulson School of Engineering and Applied Sciences, Cambridge, MA, USA [email protected] Communicated by Ran Canetti. Received 8 August 2014 / Revised 1 July 2020
Abstract. Assuming the existence of one-way functions, we show that there is no polynomial-time differentially private algorithm A that takes a database D ∈ ({0, 1}d )n and outputs a “synthetic database” Dˆ all of whose two-way marginals are approximately equal to those of D. (A two-way marginal is the fraction of database rows x ∈ {0, 1}d with a given pair of values in a given pair of columns.) This answers a question of Barak et al. (PODS ‘07), who gave an algorithm running in time poly(n, 2d ). Our proof combines a construction of hard-to-sanitize databases based on digital signatures (by Dwork et al., STOC ‘09) with encodings based on the PCP theorem. We also present both negative and positive results for generating “relaxed” synthetic data, where the fraction of rows in D satisfying a predicate c are estimated by applying c to each row of Dˆ and aggregating the results in some way.
1. Introduction There are many settings in which it is desirable to share information about a database that contains sensitive information about individuals. For example, doctors may want to share information about health records with medical researchers, the federal government may want to release census data for public information, and a company like Netflix may want to provide its movie rental database for a public competition to develop a better recommendation system. However, it is important to do this in way that preserves the “privacy” of the individuals whose records are in the database. Differential privacy has emerged over the past decade as a robust and mathematically rigorous definition of individual privacy (see the textbook [17]). ∗ A preliminary version of this work appeared in the Theory of Cryptography Conference 2011. J. Ullman: This work was done while the author was in the Harvard John A. Paulson School of Engineering and Applied Sciences. Supported by NSF Grant CNS-0831289. S. Vadhan: Supported by NSF Grant CNS-0831289.
© International Association for Cryptologic Research 2020
J. Ullman, S. Vadhan
Differential Privacy. A randomized algorithm A is defined to be differentially private [13] if for every two databases D = (x1 , . . . , xn ), D = (x1 , . . . , xn ) that differ on exactly one row, the distributions A(D) and A(D ) are “close” to each other. Formally, we require that A(D) and A(D ) assign the same probability mass to every event, up to a multiplicative factor of eε ≈ 1 + ε, where ε is typically taken to be a small constant. (In addition to this multiplicative factor, the probabilities are often allowed to differ by a negligible additive term.) This captures the idea that no individual’s data have a significant influence on the output
Data Loading...