ARYANA: Aligning Reads by Yet Another Approach

  • PDF / 1,284,517 Bytes
  • 10 Pages / 595.276 x 793.701 pts Page_size
  • 103 Downloads / 217 Views

DOWNLOAD

REPORT


PROCEEDINGS

Open Access

ARYANA: Aligning Reads by Yet Another Approach Milad Gholami1†, Aryan Arbabi2†, Ali Sharifi-Zarchi3,4, Hamidreza Chitsaz5, Mehdi Sadeghi6* From RECOMB-Seq: Fourth Annual RECOMB Satellite Workshop on Massively Parallel Sequencing Pittsburgh, PA, USA. 31 March - 05 April 2014

Abstract Motivation: Although there are many different algorithms and software tools for aligning sequencing reads, fast gapped sequence search is far from solved. Strong interest in fast alignment is best reflected in the $106 prize for the Innocentive competition on aligning a collection of reads to a given database of reference genomes. In addition, de novo assembly of next-generation sequencing long reads requires fast overlap-layout-concensus algorithms which depend on fast and accurate alignment. Contribution: We introduce ARYANA, a fast gapped read aligner, developed on the base of BWA indexing infrastructure with a completely new alignment engine that makes it significantly faster than three other aligners: Bowtie2, BWA and SeqAlto, with comparable generality and accuracy. Instead of the time-consuming backtracking procedures for handling mismatches, ARYANA comes with the seed-and-extend algorithmic framework and a significantly improved efficiency by integrating novel algorithmic techniques including dynamic seed selection, bidirectional seed extension, reset-free hash tables, and gap-filling dynamic programming. As the read length increases ARYANA’s superiority in terms of speed and alignment rate becomes more evident. This is in perfect harmony with the read length trend as the sequencing technologies evolve. The algorithmic platform of ARYANA makes it easy to develop mission-specific aligners for other applications using ARYANA engine. Availability: ARYANA with complete source code can be obtained from http://github.com/aryana-aligner

Introduction Every living cell carries a book of life consisting of several thousand to billions of characters with answers to many vital questions. Human efforts to decipher that book has gained increasing momentum since 1953 when the double helical structure of DNA was discovered. Twenty years later. W. Gilbert and A. Maxarn read the first 24-character word of the book [1]. when F. Sanger and his colleagues were developing another sequencing method based on the application of labeled dideoxynucleotide triphosphates that act as chain-terminators in a PCR reaction [2,3]. About three decades after the first DNA sequencing, the dream of reading the human book of life was realized by completion of the human genome project [4-6]. The * Correspondence: [email protected] † Contributed equally 6 National Institute of Genetic Engineering and Biotechnology, Tehran, Iran Full list of author information is available at the end of the article

International Human Genome Sequencing Consortium used a laborious hierarchical process to divide the genome into smaller covering tiles while the Celera Genomics firm replaced that by a computational sequence-assembly software applied to the data