Near-optimal assembly for shotgun sequencing with noisy reads

  • PDF / 1,723,060 Bytes
  • 14 Pages / 595.276 x 793.701 pts Page_size
  • 28 Downloads / 208 Views

DOWNLOAD

REPORT


PROCEEDINGS

Open Access

Near-optimal assembly for shotgun sequencing with noisy reads Ka-Kit Lam1, Asif Khalak2, David Tse1* From RECOMB-Seq: Fourth Annual RECOMB Satellite Workshop on Massively Parallel Sequencing Pittsburgh, PA, USA. 31 March - 05 April 2014

Abstract Recent work identified the fundamental limits on the information requirements in terms of read length and coverage depth required for successful de novo genome reconstruction from shotgun sequencing data, based on the idealistic assumption of no errors in the reads (noiseless reads). In this work, we show that even when there is noise in the reads, one can successfully reconstruct with information requirements close to the noiseless fundamental limit. A new assembly algorithm, X-phased Multibridging, is designed based on a probabilistic model of the genome. It is shown through analysis to perform well on the model, and through simulations to perform well on real genomes. Background Optimality in the acquisition and processing of DNA sequence data represents a serious technology challenge from various perspectives including sample preparation, instrumentation and algorithm development. Despite scientific achievements such as the sequencing of the human genome and ambitious plans for the future [1,2], there is no single, overarching framework to identify the fundamental limits in terms of information requirements required for successful output of the genome from the sequence data. Information theory has been successful in providing the foundation for such a framework in digital communication [3], and we believe that it can also provide insights into understanding the essential aspects of DNA sequencing. A first step in this direction has been taken in the recent work [4], where the fundamental limits on the minimum read length and coverage depth required for successful assembly are identified in terms of the statistics of various repeat patterns in the genome. Successful assembly is defined as the reconstruction of the underlying genome, i.e. genome finishing [5]. The genome finishing problem is particularly attractive for analysis because it is clearly and * Correspondence: [email protected] 1 Department of Electrical Engineering and Computer Sciences, UC Berkeley, Berkeley, California, United States Full list of author information is available at the end of the article

unambiguously defined and is arguably the ultimate goal in assembly. There is also a scientific need for finished genomes [6,7]. Until recently, automated genome finishing was beyond reach [8] in all but the simplest of genomes. New advances using ultra-long read single-molecule sequencing, however, have reported successful automated finishing [9,10]. Even in the case where finished assembly is not possible, the results in [4] provide insights on optimal use of read information since the heart of the problem lies in how one can optimally use the read information to resolve repeats. Figure 1a gives an example result for the repeat statistics of E. coli K12. The x-axis of the plot is