Synthesizing Quality Open Data Assets from Private Health Research Studies
Generating synthetic data represents an attractive solution for creating open data, enabling health research and education while preserving patient privacy. We reproduce the research outcomes obtained on two previously published studies, which used privat
- PDF / 1,220,849 Bytes
- 12 Pages / 439.37 x 666.142 pts Page_size
- 63 Downloads / 165 Views
3
Rensselaer Polytechnic Institute, Troy, NY, USA [email protected] 2 BITS Pilani, Goa Campus, Goa, India UPSud/INRIA University Paris-Saclay, Paris-Saclay, Paris, France 4 OptumLabs Visiting Fellow, San Francisco, USA
Abstract. Generating synthetic data represents an attractive solution for creating open data, enabling health research and education while preserving patient privacy. We reproduce the research outcomes obtained on two previously published studies, which used private health data, using synthetic data generated with a method that we developed, called HealthGAN. We demonstrate the value of our methodology for generating and evaluating the quality and privacy of synthetic health data. The R Data Warehouse (OLDW). The OLDW dataset are from OptumLabs is accessed within a secure environment and doesn’t allow exporting of patient level data of any type of data, real or synthetic, therefore the HealthGAN exports a privacy-preserving generator model instead. The studies examine questions related to comorbidites of Autism Spectrum Disorder (ASD) using medical records of children with ASD and matched patients without ASD. HealthGAN generates high quality synthetic data that produce similar results while preserving patient privacy. By creating synthetic versions of these datasets that maintain privacy and achieve a high level of resemblance and utility, we create valuable open health data assets for future research and education efforts.
1
Introduction
The inability to share private health data can stifle research and education activities. For example, studies based on unpublished electronic medical record (EMR) data cannot be reproduced, thus future researchers are not able to use them to develop and compare new research. This contributes to the reproduciblity crisis in biomedical research [3]. Making open data available for research can spur innovation and research. The public Medical Information Mart for Intensive Care datasets, MIMIC-II and MIMIC-III, are widely used with over 2000 citations reported in Google Scholar in March 2020 [7,10]. But since MIMIC-II and MIMIC-III focus on Intensive Care Unit patients in Boston hospitals, the resulting research may be biased and have limited generalization. Also since MIMIC requires users to undergo a training/approval process, it is not well suited for c Springer Nature Switzerland AG 2020 W. Abramowicz and G. Klein (Eds.): BIS 2020 Workshops, LNBIP 394, pp. 324–335, 2020. https://doi.org/10.1007/978-3-030-61146-0_26
Synthesizing Quality Open Data Assets from Private Health Research
325
classroom use. The cost and time required, along with re-identification risk concerns make de-identification only a partial solution to this problem. Recent synthetic data generation methods provide an attractive alternative for making data available for research and education purposes without violating privacy. Deep learning approaches for synthetic data specifically show significant promise [1,6,8] In the future, synthetic data generation methods combined with automatic machin
Data Loading...