Mixed logistic regression in genome-wide association studies
- PDF / 2,317,282 Bytes
- 17 Pages / 595.276 x 790.866 pts Page_size
- 106 Downloads / 232 Views
METHODOLOGY ARTICLE
Open Access
Mixed logistic regression in genome‑wide association studies Jacqueline Milet1, David Courtin1, André Garcia1 and Hervé Perdry2*
*Correspondence: [email protected] 2 Université Paris-Saclay, UVSQ, Inserm, CESP, 94807 Villejuif, France Full list of author information is available at the end of the article
Abstract Background: Mixed linear models (MLM) have been widely used to account for population structure in case-control genome-wide association studies, the status being analyzed as a quantitative phenotype. Chen et al. proved in 2016 that this method is inappropriate in some situations and proposed GMMAT, a score test for the mixed logistic regression (MLR). However, this test does not produces an estimation of the variants’ effects. We propose two computationally efficient methods to estimate the variants’ effects. Their properties and those of other methods (MLM, logistic regression) are evaluated using both simulated and real genomic data from a recent GWAS in two geographically close population in West Africa. Results: We show that, when the disease prevalence differs between population strata, MLM is inappropriate to analyze binary traits. MLR performs the best in all circumstances. The variants’ effects are well evaluated by our methods, with a moderate bias when the effect sizes are large. Additionally, we propose a stratified QQ-plot, enhancing the diagnosis of p values inflation or deflation when population strata are not clearly identified in the sample. Conclusion: The two proposed methods are implemented in the R package milorG‑ WAS available on the CRAN. Both methods scale up to at least 10,000 individuals. The same computational strategies could be applied to other models (e.g. mixed Cox model for survival analysis). Keywords: GWAS, Mixed-models, Logistic regression
Background Population stratification has long been known to be at the origin of spurious associations in genetic association studies [1]: if the frequency of the phenotype of interest varies across the population strata, it will be associated to any allele the frequency of which varies accordingly. An early and elegant solution to this issue has been the use of family data, notably in the Transmission Disequilibrium Test (TDT) [2] and in the Family Based Association Test (FBAT) [3]. However, these methods imposed the ascertainment and genotyping of affected individuals’ relatives, impairing their practical feasibility. The advent of Genome-Wide Association Studies (GWAS), demanding increasingly large samples to detect weaker and weaker effects, made the problem even more accurate. © The Author(s) 2020. Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party mater
Data Loading...