Superstring Graph: A New Approach for Genome Assembly

With the increasing impact of genomics in life sciences, the inference of high quality, reliable, and complete genome sequences is becoming critical. Genome assembly remains a major bottleneck in bioinformatics: indeed, high throughput sequencing apparatu

PDF / 445,599 Bytes
14 Pages / 439.37 x 666.142 pts Page_size
77 Downloads / 176 Views

DOWNLOAD

REPORT

LIRMM, Universit´e de Montpellier, CNRS UMR 5506, Montpellier, France {cazaux,rivals}@lirmm.fr 2 Institut Biologie Computationnelle, Montpellier, France 3 INRIA Rhˆ one-Alpes and Universit´e Lyon 1, CNRS, UMR 5558, 69000 Lyon, France [email protected] Abstract. With the increasing impact of genomics in life sciences, the inference of high quality, reliable, and complete genome sequences is becoming critical. Genome assembly remains a major bottleneck in bioinformatics: indeed, high throughput sequencing apparatus yield millions of short sequencing reads that need to be merged based on their overlaps. Overlap graph based algorithms were used with the ﬁrst generation of sequencers, while de Bruijn graph (DBG) based methods were preferred for the second generation. Because the sequencing coverage varies locally along the molecule, state-of-the-art assembly programs now follow an iterative process that requires the construction of de Bruijn graphs of distinct orders (i.e., sizes of the overlaps). The set of resulting sequences, termed unitigs, provide an important improvement compared to single DBG approaches. Here, we present a novel approach based on a digraph, the Superstring Graph, that captures all desired sizes of overlaps at once and allows to discard unreliable overlaps. With a simple algorithm, the Superstring Graph delivers sequences that includes all the unitigs obtained from multiple DBG as substrings. In linear time and space, it combines the eﬃciency of a greedy approach to the advantages of using a single graph. In summary, we present a ﬁrst and formal comparison of the output of state-of-the-art genome assemblers.

1

Introduction

Ongoing improvements in DNA sequencing technologies have dramatically increased the throughput of sequencers, thereby authorising the launch of very large genome projects: the 1000 human genomes for studying natural variations [15], the 10 K vertebrate genomes for phylogenomics issues [11] or the 10,000 rice genomes, which aims at getting a genomic overview of all wild and cultivated rice varieties. If getting the collections of raw sequencing reads becomes easier and cheaper, assembling complex eukaryotic genomes remains one of the major practical and theoretical challenges in bioinformatics. This work is supported by ANR Colib’read (ANR-12-BS02-0008), the Institut de Biologie Computationnelle (ANR-11-BINF-0002). c Springer International Publishing Switzerland 2016 R. Dondi et al. (Eds.): AAIM 2016, LNCS 9778, pp. 39–52, 2016. DOI: 10.1007/978-3-319-41168-2 4

40

B. Cazaux et al.

With the advent of Next Generation sequencing technologies, most of the sequencing performed yields huge numbers of short reads. For that reason, the de Bruijn Graph (DBG) approach, also termed as Eulerian sequence assembly, has been preferred to the Overlap-Layout-Consensus approach, which resorts to an overlap graph and was used with traditional Sanger sequencing. The DBG encodes each k-mer of the read set as a node and contains an arc from one node to another if the k + 1-mer obtained by me

Data Loading...

Superstring Graph: A New Approach for Genome Assembly

Recommend Documents

From one linear genome to a graph-based pan-genome: a new era for genomics

Virus Graph and COVID-19 Pandemic: A Graph Theory Approach

GAML: genome assembly by maximum likelihood

Multiplex PCR for Gap Closing (Whole-genome Assembly)

A systematic comparison of chloroplast genome assembly tools

A Diagnostic Approach for Neurogenetic Disorders in the Genome Era

Assembly of TALEN and mTALE-Act for Plant Genome Engineering

Graph Based Approach for Heart Disease Prediction

A Knowledge Graph Based Approach for Mobile Application Recommendation

Using GraphQL with a New Graph Database

A graph database-based approach utilizing FAHP and directed bipartite graph for service composition

Astrophysics A New Approach