Superstring Graph: A New Approach for Genome Assembly

With the increasing impact of genomics in life sciences, the inference of high quality, reliable, and complete genome sequences is becoming critical. Genome assembly remains a major bottleneck in bioinformatics: indeed, high throughput sequencing apparatu

  • PDF / 445,599 Bytes
  • 14 Pages / 439.37 x 666.142 pts Page_size
  • 77 Downloads / 166 Views

DOWNLOAD

REPORT


LIRMM, Universit´e de Montpellier, CNRS UMR 5506, Montpellier, France {cazaux,rivals}@lirmm.fr 2 Institut Biologie Computationnelle, Montpellier, France 3 INRIA Rhˆ one-Alpes and Universit´e Lyon 1, CNRS, UMR 5558, 69000 Lyon, France [email protected] Abstract. With the increasing impact of genomics in life sciences, the inference of high quality, reliable, and complete genome sequences is becoming critical. Genome assembly remains a major bottleneck in bioinformatics: indeed, high throughput sequencing apparatus yield millions of short sequencing reads that need to be merged based on their overlaps. Overlap graph based algorithms were used with the first generation of sequencers, while de Bruijn graph (DBG) based methods were preferred for the second generation. Because the sequencing coverage varies locally along the molecule, state-of-the-art assembly programs now follow an iterative process that requires the construction of de Bruijn graphs of distinct orders (i.e., sizes of the overlaps). The set of resulting sequences, termed unitigs, provide an important improvement compared to single DBG approaches. Here, we present a novel approach based on a digraph, the Superstring Graph, that captures all desired sizes of overlaps at once and allows to discard unreliable overlaps. With a simple algorithm, the Superstring Graph delivers sequences that includes all the unitigs obtained from multiple DBG as substrings. In linear time and space, it combines the efficiency of a greedy approach to the advantages of using a single graph. In summary, we present a first and formal comparison of the output of state-of-the-art genome assemblers.

1

Introduction

Ongoing improvements in DNA sequencing technologies have dramatically increased the throughput of sequencers, thereby authorising the launch of very large genome projects: the 1000 human genomes for studying natural variations [15], the 10 K vertebrate genomes for phylogenomics issues [11] or the 10,000 rice genomes, which aims at getting a genomic overview of all wild and cultivated rice varieties. If getting the collections of raw sequencing reads becomes easier and cheaper, assembling complex eukaryotic genomes remains one of the major practical and theoretical challenges in bioinformatics. This work is supported by ANR Colib’read (ANR-12-BS02-0008), the Institut de Biologie Computationnelle (ANR-11-BINF-0002). c Springer International Publishing Switzerland 2016  R. Dondi et al. (Eds.): AAIM 2016, LNCS 9778, pp. 39–52, 2016. DOI: 10.1007/978-3-319-41168-2 4

40

B. Cazaux et al.

With the advent of Next Generation sequencing technologies, most of the sequencing performed yields huge numbers of short reads. For that reason, the de Bruijn Graph (DBG) approach, also termed as Eulerian sequence assembly, has been preferred to the Overlap-Layout-Consensus approach, which resorts to an overlap graph and was used with traditional Sanger sequencing. The DBG encodes each k-mer of the read set as a node and contains an arc from one node to another if the k + 1-mer obtained by me