High dimensional model representation of log likelihood ratio: binary classification with SNP data

PDF / 2,896,889 Bytes
22 Pages / 595 x 791 pts Page_size
19 Downloads / 204 Views

RESEARCH

Open Access

High dimensional model representation of log likelihood ratio: binary classification with SNP data Ali Foroughi pour1,2 , Maciej Pietrzak3,5 , Lara E. Sucheston-Campbell6 , Ezgi Karaesmen6 , Lori A. Dalton1 and Grzegorz A. Rempała3,4* From The International Conference on Intelligent Biology and Medicine (ICIBM) 2019 Columbus, OH, USA. 9-11 June 2019

Abstract Background: Developing binary classification rules based on SNP observations has been a major challenge for many modern bioinformatics applications, e.g., predicting risk of future disease events in complex conditions such as cancer. Small-sample, high-dimensional nature of SNP data, weak effect of each SNP on the outcome, and highly non-linear SNP interactions are several key factors complicating the analysis. Additionally, SNPs take a finite number of values which may be best understood as ordinal or categorical variables, but are treated as continuous ones by many algorithms. Methods: We use the theory of high dimensional model representation (HDMR) to build appropriate low dimensional glass-box models, allowing us to account for the effects of feature interactions. We compute the second order HDMR expansion of the log-likelihood ratio to account for the effects of single SNPs and their pairwise interactions. We propose a regression based approach, called linear approximation for block second order HDMR expansion of categorical observations (LABS-HDMR-CO), to approximate the HDMR coefficients. We show how HDMR can be used to detect pairwise SNP interactions, and propose the fixed pattern test (FPT) to identify statistically significant pairwise interactions. Results: We apply LABS-HDMR-CO and FPT to synthetically generated HAPGEN2 data as well as to two GWAS cancer datasets. In these examples LABS-HDMR-CO enjoys superior accuracy compared with several algorithms used for SNP classification, while also taking pairwise interactions into account. FPT declares very few significant interactions in the small sample GWAS datasets when bounding false discovery rate (FDR) by 5%, due to the large number of tests performed. On the other hand, LABS-HDMR-CO utilizes a large number of SNP pairs to improve its prediction accuracy. In the larger HAPGEN2 dataset FTP declares a larger portion of SNP pairs used by LABS-HDMR-CO as significant. Conclusion: LABS-HDMR-CO and FPT are interesting methods to design prediction rules and detect pairwise feature interactions for SNP data. Reliably detecting pairwise SNP interactions and taking advantage of potential interactions (Continued on next page)

*Correspondence: [email protected] Mathematical Biosciences Institute, 1735 Neil Ave, 43210 Columbus OH USA 4 College of Public Health, The Ohio State University, 1841 Neil Ave, 43210 Columbus OH, USA Full list of author information is available at the end of the article 3

© The Author(s). 2020 Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction

Data Loading...

High dimensional model representation of log likelihood ratio: binary classification with SNP data

Recommend Documents

Likelihood ratio

Binary classification with ambiguous training data

A novel feature learning framework for high-dimensional data classification

A practical method for well log data classification

A new formalism for representation of binary thermodynamic data

Fuzzy Clustering of High Dimensional Data with Noise and Outliers

Adversarial Training with Bi-directional Likelihood Regularization for Visual Classification

Interpolation of sparse high-dimensional data

Under-Sample Binary Data Using CURE for Classification

Visualization of High-Dimensional Biomedical Image Data

A group evaluation based binary PSO algorithm for feature selection in high dimensional data

Separating Information Maximum Likelihood Method for High-Frequency Financial Data