Embedding covariate adjustments in tree-based automated machine learning for biomedical big data analyses

  • PDF / 1,580,225 Bytes
  • 13 Pages / 595.276 x 790.866 pts Page_size
  • 32 Downloads / 181 Views

DOWNLOAD

REPORT


METHODOLOGY ARTICLE

Open Access

Embedding covariate adjustments in tree‑based automated machine learning for biomedical big data analyses Elisabetta Manduchi1,2*  , Weixuan Fu2, Joseph D. Romano1, Stefano Ruberto1 and Jason H. Moore1,2 *Correspondence: manduchi@pennmedicine. upenn.edu 1 Department of Biostatistics, Epidemiology and Informatics, University of Pennsylvania, Philadelphia, PA 19104, USA Full list of author information is available at the end of the article

Abstract  Background:  A typical task in bioinformatics consists of identifying which features are associated with a target outcome of interest and building a predictive model. Automated machine learning (AutoML) systems such as the Tree-based Pipeline Optimization Tool (TPOT) constitute an appealing approach to this end. However, in biomedical data, there are often baseline characteristics of the subjects in a study or batch effects that need to be adjusted for in order to better isolate the effects of the features of interest on the target. Thus, the ability to perform covariate adjustments becomes particularly important for applications of AutoML to biomedical big data analysis. Results:  We developed an approach to adjust for covariates affecting features and/or target in TPOT. Our approach is based on regressing out the covariates in a manner that avoids ‘leakage’ during the cross-validation training procedure. We describe applications of this approach to toxicogenomics and schizophrenia gene expression data sets. The TPOT extensions discussed in this work are available at https​://githu​b.com/Epist​ asisL​ab/tpot/tree/v0.11.1-resAd​j. Conclusions:  In this work, we address an important need in the context of AutoML, which is particularly crucial for applications to bioinformatics and medical informatics, namely covariate adjustments. To this end we present a substantial extension of TPOT, a genetic programming based AutoML approach. We show the utility of this extension by applications to large toxicogenomics and differential gene expression data. The method is generally applicable in many other scenarios from the biomedical field. Keywords:  AutoML, Covariate adjustment, Genetic programming, Pathways, Feature importance

Background Automated machine learning (AutoML) refers to methods which assist (potentially nonexpert) users in the optimization of model construction steps such as data preprocessing, feature selection, feature transformations, model selection, and hyperparameter tuning. The Tree-based Pipeline Optimization Tool (TPOT) [1, 2] is a genetic programming (GP) based AutoML which has been successfully used in biomedical applications © The Author(s). 2020 Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other thir