Bayesian copy number detection and association in large-scale studies

  • PDF / 3,614,677 Bytes
  • 14 Pages / 595 x 791 pts Page_size
  • 40 Downloads / 202 Views

DOWNLOAD

REPORT


RESEARCH ARTICLE

Open Access

Bayesian copy number detection and association in large-scale studies Stephen Cristiano1† , David McKean2† , Jacob Carey1† , Paige Bracci3 , Paul Brennan4 , Michael Chou5 , Mengmeng Du6 , Steven Gallinger7 , Michael G. Goggins8,9 , Manal M. Hassan10 , Rayjean J. Hung7 , Robert C. Kurtz11 , Donghui Li12 , Lingeng Lu13 , Rachel Neale14 , Sara Olson6 , Gloria Petersen15 , Kari G. Rabe15 , Jack Fu1 , Harvey Risch13 , Gary L. Rosner1,10 , Ingo Ruczinski1 , Alison P. Klein2,5,9* and Robert B. Scharpf1,2*

Abstract Background: Germline copy number variants (CNVs) increase risk for many diseases, yet detection of CNVs and quantifying their contribution to disease risk in large-scale studies is challenging due to biological and technical sources of heterogeneity that vary across the genome within and between samples. Methods: We developed an approach called CNPBayes to identify latent batch effects in genome-wide association studies involving copy number, to provide probabilistic estimates of integer copy number across the estimated batches, and to fully integrate the copy number uncertainty in the association model for disease. Results: Applying a hidden Markov model (HMM) to identify CNVs in a large multi-site Pancreatic Cancer Case Control study (PanC4) of 7598 participants, we found CNV inference was highly sensitive to technical noise that varied appreciably among participants. Applying CNPBayes to this dataset, we found that the major sources of technical variation were linked to sample processing by the centralized laboratory and not the individual study sites. Modeling the latent batch effects at each CNV region hierarchically, we developed probabilistic estimates of copy number that were directly incorporated in a Bayesian regression model for pancreatic cancer risk. Candidate associations aided by this approach include deletions of 8q24 near regulatory elements of the tumor oncogene MYC and of Tumor Suppressor Candidate 3 (TUSC3). Conclusions: Laboratory effects may not account for the major sources of technical variation in genome-wide association studies. This study provides a robust Bayesian inferential framework for identifying latent batch effects, estimating copy number, and evaluating the role of copy number in heritable diseases. Keywords: Pancreatic cancer, SNP array, Copy number variants, Genome-wide association, CNPBayes, Batch effects

*Correspondence: [email protected]; [email protected] † Stephen Cristiano, David McKean, and Jacob Carey contributed equally to this work. 2 Department of Oncology The Sidney Kimmel Comprehensive Cancer Center, Johns Hopkins University School of Medicine, Baltimore, MD, USA 5 Department of Epidemiology, Johns Hopkins Bloomberg School of Public Health, Baltimore, MD, USA Full list of author information is available at the end of the article © The Author(s). 2020 Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium