Embedding covariate adjustments in tree-based automated machine learning for biomedical big data analyses

PDF / 1,580,225 Bytes
13 Pages / 595.276 x 790.866 pts Page_size
32 Downloads / 281 Views

METHODOLOGY ARTICLE

Open Access

Embedding covariate adjustments in tree‑based automated machine learning for biomedical big data analyses Elisabetta Manduchi1,2* , Weixuan Fu2, Joseph D. Romano1, Stefano Ruberto1 and Jason H. Moore1,2 *Correspondence: manduchi@pennmedicine. upenn.edu 1 Department of Biostatistics, Epidemiology and Informatics, University of Pennsylvania, Philadelphia, PA 19104, USA Full list of author information is available at the end of the article

Abstract Background: A typical task in bioinformatics consists of identifying which features are associated with a target outcome of interest and building a predictive model. Automated machine learning (AutoML) systems such as the Tree-based Pipeline Optimization Tool (TPOT) constitute an appealing approach to this end. However, in biomedical data, there are often baseline characteristics of the subjects in a study or batch effects that need to be adjusted for in order to better isolate the effects of the features of interest on the target. Thus, the ability to perform covariate adjustments becomes particularly important for applications of AutoML to biomedical big data analysis. Results: We developed an approach to adjust for covariates affecting features and/or target in TPOT. Our approach is based on regressing out the covariates in a manner that avoids ‘leakage’ during the cross-validation training procedure. We describe applications of this approach to toxicogenomics and schizophrenia gene expression data sets. The TPOT extensions discussed in this work are available at https://github.com/Epist asisLab/tpot/tree/v0.11.1-resAdj. Conclusions: In this work, we address an important need in the context of AutoML, which is particularly crucial for applications to bioinformatics and medical informatics, namely covariate adjustments. To this end we present a substantial extension of TPOT, a genetic programming based AutoML approach. We show the utility of this extension by applications to large toxicogenomics and differential gene expression data. The method is generally applicable in many other scenarios from the biomedical field. Keywords: AutoML, Covariate adjustment, Genetic programming, Pathways, Feature importance

Background Automated machine learning (AutoML) refers to methods which assist (potentially nonexpert) users in the optimization of model construction steps such as data preprocessing, feature selection, feature transformations, model selection, and hyperparameter tuning. The Tree-based Pipeline Optimization Tool (TPOT) [1, 2] is a genetic programming (GP) based AutoML which has been successfully used in biomedical applications © The Author(s). 2020 Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other thir

Data Loading...

Embedding covariate adjustments in tree-based automated machine learning for biomedical big data analyses

Recommend Documents

Big Data and Machine Learning

Machine Learning and Deep Learning Models for Big Data Issues

A Machine Learning Platform for NLP in Big Data

Big Biomedical Data Engineering

Big Data Analytics and Machine Learning Technologies for HPC Applications

Rule Based Systems for Big Data A Machine Learning Approach

Machine Learning Models and Algorithms for Big Data Classification T

Testing in times of big data and machine learning

Big Data Analyses, Services, and Smart Data

Intelligence Science and Big Data Engineering. Big Data and Machine Learning

Intelligence Science and Big Data Engineering. Big Data and Machine Learning Techniques

Granular Computing Based Machine Learning A Big Data Processing Appr