The design and construction of reference pangenome graphs with minigraph

  • PDF / 1,681,310 Bytes
  • 19 Pages / 595 x 794 pts Page_size
  • 52 Downloads / 273 Views

DOWNLOAD

REPORT


METHOD

Open Access

The design and construction of reference pangenome graphs with minigraph Heng Li1,2*

, Xiaowen Feng1,2 and Chong Chu2

*Correspondence: [email protected] 1 Department of Data Sciences, Dana-Farber Cancer Institute, Boston 02215, MA, USA 2 Department of Biomedical Informatics, Harvard Medical School, Boston 02215, MA, USA

Abstract The recent advances in sequencing technologies enable the assembly of individual genomes to the quality of the reference genome. How to integrate multiple genomes from the same species and make the integrated representation accessible to biologists remains an open challenge. Here, we propose a graph-based data model and associated formats to represent multiple genomes while preserving the coordinate of the linear reference genome. We implement our ideas in the minigraph toolkit and demonstrate that we can efficiently construct a pangenome graph and compactly encode tens of thousands of structural variants missing from the current reference genome. Keywords: Bioinformatics, Genomics, Pangenome

Background The human reference genome is a fundamental resource for human genetics and biomedical research. The primary sequences of the reference genome GRCh38 [1] are a mosaic of haplotypes with each haplotype segment derived from a single human individual. They cannot represent the genetic diversity in human populations, and as a result, each individual may carry thousands of large germline variants absent from the reference genome [2]. Some of these variants are likely associated with phenotype [3] but are often missed or misinterpreted when we map sequence data to GRCh38, in particular with short reads [4]. This under-representation of genetic diversity may become a limiting factor in our understanding of genetic variations. Meanwhile, the advances in long-read sequencing technologies make it possible to assemble a human individual to a quality comparable to GRCh38 [1, 5]. There are already a dozen of high-quality human assemblies available in GenBank [6]. Properly integrating these genomes into a reference pangenome, which refers to a collection of genomes [7], would potentially address the issues with a single linear reference.

© The Author(s). 2020 Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.o