A pitfall for machine learning methods aiming to predict across cell types

  • PDF / 597,175 Bytes
  • 6 Pages / 595 x 794 pts Page_size
  • 11 Downloads / 141 Views

DOWNLOAD

REPORT


SHORT REPORT

Open Access

A pitfall for machine learning methods aiming to predict across cell types Jacob Schreiber1 , Ritambhara Singh2,3 , Jeffrey Bilmes1,4 and William Stafford Noble1,2* *Correspondence: [email protected] 1 Paul G. Allen School of Computer Science & Engineering, University of Washington, Seattle, USA 2 Department of Genome Science, University of Washington, Seattle, USA Full list of author information is available at the end of the article

Abstract Machine learning models that predict genomic activity are most useful when they make accurate predictions across cell types. Here, we show that when the training and test sets contain the same genomic loci, the resulting model may falsely appear to perform well by effectively memorizing the average activity associated with each locus across the training cell types. We demonstrate this phenomenon in the context of predicting gene expression and chromatin domain boundaries, and we suggest methods to diagnose and avoid the pitfall. We anticipate that, as more data becomes available, future projects will increasingly risk suffering from this issue. Keywords: Machine learning, Epigenomics, Genomics

Machine learning has been applied to a wide variety of genomic prediction problems, such as predicting transcription factor binding, identifying active cis-regulatory elements, constructing gene regulatory networks, and predicting the effects of single nucleotide polymorphisms. The inputs to these models typically include some combination of nucleotide sequence and signals from epigenomics assays. Given such data, the most common approach to evaluating predictive models is a “cross-chromosomal” strategy, which involves training a separate model for each cell type and partitioning genomic loci into some number of folds for cross-validation (Fig. 1a). Typically, the genomic loci are split by chromosome. This strategy has been employed for models that predict gene expression [1–3], elements of chromatin architecture [4, 5], transcription factor binding [6, 7], and cis-regulatory elements [8–13]. Although the cross-chromosomal approach measures how well the model generalizes to new genomic loci, it does not measure how well the model generalizes to new cell types. As such, the cross-chromosomal approach is typically used when the primary goal is to obtain biological insights from the trained model. An alternative, “cross-cell type” validation approach can be used to measure how well a model generalizes to a new cell type. This approach involves training a model in one or more cell types and then evaluating it in one or more other cell types (Fig. 1b). Note that

© The Author(s). 2020 Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other th