State of the art de novo assembly of human genomes from massively parallel sequencing data
- PDF / 168,769 Bytes
- 7 Pages / 609.449 x 790.866 pts Page_size
- 88 Downloads / 190 Views
State of the art de novo assembly of human genomes from massively parallel sequencing data Yingrui Li,1 Yujie Hu,1,2 Lars Bolund1,3 and Jun Wang1,2* 1
BGI-Shenzhen, Shenzhen, Guangdong 518083, China The Graduate University of the Chinese Academy of Sciences, Beijing 100062, China 3 Department of Biology, University of Copenhagen, Copenhagen DK-2200, Denmark; Danish Center for Translational Breast Cancer Research, Copenhagen, Denmark; Institute of Human Genetics, University of Aarhus, Denmark *Correspondence to: E-mail: [email protected] 2
Date received (in revised form): 17th March 2010
Abstract Recent studies in human genomes have demonstrated the use of de novo assemblies to identify genetic variations that are difficult for mapping-based approaches. Construction of multiple human genome assemblies is enabled by massively parallel sequencing, but a conventional bioinformatics solution is costly and slow, creating bottlenecks in the process. This review describes two public short-read de novo assembly applications that can handle human genomes, ABySS and SOAPdenovo. It also discusses the technical aspects and future challenges of human genome de novo assembly by short reads. Keywords: de novo assembly, de Bruijn graph, massively parallel sequencing
Introduction One of the important goals of bioinformatics is to decipher the genome DNA sequence of a species. The genome serves as the digital basis of any life science. Access to a reference genome sequence for a species significantly facilitates biological studies, as proven by all the genomics-guided research in the wake of the Human Genome Project.1 It is conventionally believed that when a reference genome is available, any following studies will take a mapping-based ‘re-sequencing’ approach aiming for variation detection, as seen in many projects of human genomics.2,3 Recent studies, however, suggest that assembly-based approaches have greater potential to detect a more complete set of genetic variations, especially novel sequences4 and structural variations,5 even in relatively well-studied human genomes. Thus, assembly of individual genomes has again been brought to the frontier of
bioinformatics. With multiple assembled individual genomes available, it would be very interesting to see how rearrangements of different length scales and individual-specific sequences are distributed in the populations. The size of the human genome constrained individual human assembly by conventional Sanger sequencing because of costs. Second-generation sequencing technology produces large amounts of data more affordably, but the intrinsic highthroughput and short-read-length present considerable challenges to bioinformatics because of the difficulties in handling the data structure and in applying an appropriate assembly algorithm. Although many short-read de novo assemblers have been developed,6 only two of them, ABySS7 and SOAPdenovo,8 are said to be capable of assembling human genomes de novo. This paper presents a review of the two software packages and discusses the
Data Loading...