COCOA: coordinate covariation analysis of epigenetic heterogeneity

  • PDF / 2,373,107 Bytes
  • 23 Pages / 595.276 x 793.701 pts Page_size
  • 10 Downloads / 252 Views

DOWNLOAD

REPORT


METHOD

Open Access

COCOA: coordinate covariation analysis of epigenetic heterogeneity John T. Lawson1,2, Jason P. Smith2,3, Stefan Bekiranov2,3, Francine E. Garrett-Bakelman3,4,5 and Nathan C. Sheffield1,2,3,5* * Correspondence: nsheffield@ virginia.edu 1 Department of Biomedical Engineering, University of Virginia, Charlottesville, VA, USA 2 Center for Public Health Genomics, University of Virginia, Charlottesville, VA, USA Full list of author information is available at the end of the article

Abstract A key challenge in epigenetics is to determine the biological significance of epigenetic variation among individuals. We present Coordinate Covariation Analysis (COCOA), a computational framework that uses covariation of epigenetic signals across individuals and a database of region sets to annotate epigenetic heterogeneity. COCOA is the first such tool for DNA methylation data and can also analyze any epigenetic signal with genomic coordinates. We demonstrate COCOA’s utility by analyzing DNA methylation, ATAC-seq, and multi-omic data in supervised and unsupervised analyses, showing that COCOA provides new understanding of inter-sample epigenetic variation. COCOA is available on Bioconductor (http:// bioconductor.org/packages/COCOA). Keywords: Epigenetics, DNA methylation, Chromatin accessibility, Principal component analysis, Dimensionality reduction, Data integration, Cancer, EZH2, Multiomics

Introduction Epigenetic data is inherently high-dimensional and often difficult to interpret. Because of the high dimensionality, it is common to group individual genomic loci into collections that share a functional annotation, such as binding of a particular transcription factor [1–3]. These genomic locus collections, or region sets, are analogous to the more common gene sets, but relax the constraint that data must be gene-centric. While gene set approaches may be applied to epigenetic data by linking regions to nearby genes [4], this linking process is ambiguous and loses information because a regulatory locus may affect the expression of multiple genes or more distant genes. Alternatively, a region-centric approach is often more appropriate for epigenetic data, and there are now many region-based databases and analytical approaches [1, 2, 5–7], such as using region set databases for enrichment analysis [1, 7, 8] or to aggregate epigenetic signals from individual samples across regions to assign scores of regulatory activity to individual samples or single cells [2, 3, 6, 9]. © The Author(s). 2020 Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the materi