Bifrost: highly parallel construction and indexing of colored and compacted de Bruijn graphs

  • PDF / 600,022 Bytes
  • 20 Pages / 595 x 794 pts Page_size
  • 87 Downloads / 199 Views

DOWNLOAD

REPORT


SOFTWARE

Open Access

Bifrost: highly parallel construction and indexing of colored and compacted de Bruijn graphs Guillaume Holley* and Páll Melsted *Correspondence: [email protected] Faculty of Industrial Engineering, Mechanical Engineering and Computer Science, University of Iceland, Reykjavík, Iceland

Abstract Memory consumption of de Bruijn graphs is often prohibitive. Most de Bruijn graph-based assemblers reduce the complexity by compacting paths into single vertices, but this is challenging as it requires the uncompacted de Bruijn graph to be available in memory. We present a parallel and memory-efficient algorithm enabling the direct construction of the compacted de Bruijn graph without producing the intermediate uncompacted graph. Bifrost features a broad range of functions, such as indexing, editing, and querying the graph, and includes a graph coloring method that maps each k-mer of the graph to the genomes it occurs in. Availability: https://github.com/pmelsted/bifrost

Introduction The de Bruijn graph is an abstract data structure with a rich history in computational biology as a tool for genome assembly [1, 2]. With the advent of high throughput sequencing, the Overlap Layout Consensus (OLC) framework frequently used to assemble Sanger sequencing data [3] was progressively replaced in favor of de Bruijn graph-based methods. Since 2008, a wide range of genome assemblers based on the de Bruijn graph have been released [4–10]. Although single molecule sequencing technologies [11, 12] have reintroduced the OLC framework as the method of choice to assemble long and erroneous reads [13–16], de Bruijn graph-based methods are nonetheless used to assemble and correct long reads [17, 18]. Overall, de Bruijn graphs have found widespread use for a variety of problems such as de novo transcriptome assembly [19], variant calling [20], short read compression [21], short read correction [22], long read correction [17], and short read mapping [23] to name a few. The colored de Bruijn graph is a variant of the de Bruijn graph which keeps track of the source of each vertex in the graph [24]. The initial application was for assembly and genotyping, but it has also found use in pan-genomics [25], variant calling [26], and transcript quantification methods [27].

© The Author(s). 2020 Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly f