Near-optimal assembly for shotgun sequencing with noisy reads

PDF / 1,723,060 Bytes
14 Pages / 595.276 x 793.701 pts Page_size
28 Downloads / 321 Views

PROCEEDINGS

Open Access

Near-optimal assembly for shotgun sequencing with noisy reads Ka-Kit Lam1, Asif Khalak2, David Tse1* From RECOMB-Seq: Fourth Annual RECOMB Satellite Workshop on Massively Parallel Sequencing Pittsburgh, PA, USA. 31 March - 05 April 2014

Abstract Recent work identified the fundamental limits on the information requirements in terms of read length and coverage depth required for successful de novo genome reconstruction from shotgun sequencing data, based on the idealistic assumption of no errors in the reads (noiseless reads). In this work, we show that even when there is noise in the reads, one can successfully reconstruct with information requirements close to the noiseless fundamental limit. A new assembly algorithm, X-phased Multibridging, is designed based on a probabilistic model of the genome. It is shown through analysis to perform well on the model, and through simulations to perform well on real genomes. Background Optimality in the acquisition and processing of DNA sequence data represents a serious technology challenge from various perspectives including sample preparation, instrumentation and algorithm development. Despite scientific achievements such as the sequencing of the human genome and ambitious plans for the future [1,2], there is no single, overarching framework to identify the fundamental limits in terms of information requirements required for successful output of the genome from the sequence data. Information theory has been successful in providing the foundation for such a framework in digital communication [3], and we believe that it can also provide insights into understanding the essential aspects of DNA sequencing. A first step in this direction has been taken in the recent work [4], where the fundamental limits on the minimum read length and coverage depth required for successful assembly are identified in terms of the statistics of various repeat patterns in the genome. Successful assembly is defined as the reconstruction of the underlying genome, i.e. genome finishing [5]. The genome finishing problem is particularly attractive for analysis because it is clearly and * Correspondence: [email protected] 1 Department of Electrical Engineering and Computer Sciences, UC Berkeley, Berkeley, California, United States Full list of author information is available at the end of the article

unambiguously defined and is arguably the ultimate goal in assembly. There is also a scientific need for finished genomes [6,7]. Until recently, automated genome finishing was beyond reach [8] in all but the simplest of genomes. New advances using ultra-long read single-molecule sequencing, however, have reported successful automated finishing [9,10]. Even in the case where finished assembly is not possible, the results in [4] provide insights on optimal use of read information since the heart of the problem lies in how one can optimally use the read information to resolve repeats. Figure 1a gives an example result for the repeat statistics of E. coli K12. The x-axis of the plot is

Data Loading...

Near-optimal assembly for shotgun sequencing with noisy reads

Recommend Documents

Error Correction in Nanopore Reads for de novo Genomic Assembly

RepAHR: an improved approach for de novo repeat identification by assembly of the high-frequency reads

Shotgun Libraries

Natrix: a Snakemake-based workflow for processing, clustering, and taxonomically assigning amplicon sequencing reads

Confidence intervals for Markov chain transition probabilities based on next generation sequencing reads data

Evaluation of assembly methods combining long-reads and short-reads to obtain Paenibacillus sp. R4 high-quality complete

GPU acceleration of Darwin read overlapper for de novo assembly of long DNA reads

Shotgun Experiment

Ensemble with estimation: seeking for optimization in class noisy data

Shotgun Proteomics Methods and Protocols

A comprehensive investigation of metagenome assembly by linked-read sequencing

Ebook reads for the materials researcher