SPRITE: A Fast Parallel SNP Detection Pipeline
We present Sprite , a new high-performance data analysis pipeline for detecting single nucleotide polymorphisms (SNPs) in the human genome. A SNP detection pipeline for next-generation sequencing data uses several software tools, including tools for read
- PDF / 816,250 Bytes
- 19 Pages / 439.37 x 666.142 pts Page_size
- 10 Downloads / 215 Views
roduction
In this work, we consider the pervasive genetic variation detection workflow in biomedical informatics. The goal of this workflow is to automatically determine genetic variations present in the genome of an individual (called the donor ), by comparing it to a reference genome. SNPs are nucleotide differences at a single position and account for nearly 90 % of the total variations. Detecting SNPs with high accuracy plays a very important role in identifying disease risk, studying drug efficacy [31], etc. SNP detection using current state-of-the-art tools can take more than a day of sequential compute time, and the pipeline is typically I/O bound. In this paper, we focus on improving end-to-end efficiency and parallel scaling of this pipeline, and design new hybrid parallel algorithms and software (Sprite, comprised of prune, sampa, parsnip). The end-to-end running time of Sprite on the Stampede supercomputer is 11.7 hours on a single compute node, and 48 min on 16 nodes, for a realistic input data set. In comparison, the end-to-end time using current state-of-the-art tools on a single compute node is 23 hours, and so we achieve a speedup of 1.97× and 28.7× using single node and 16 nodes respectively. We also show that the resulting SNP detection quality is comparable to two state-of-the-art variant detection pipelines. Further, we create Sprite+ , an in-memory version that does not generate intermediate files. Sprite+ can be executed on just a few compute nodes (requiring about 105 GB aggregate main memory for the human genome). c Springer International Publishing Switzerland 2016 J.M. Kunkel et al. (Eds.): ISC High Performance 2016, LNCS 9697, pp. 159–177, 2016. DOI: 10.1007/978-3-319-41321-1 9
160
2
V. Rengasamy and K. Madduri
Background: Variant Detection Pipelines
The genome sequences of any two (human) individuals are highly similar. However, the small percentage of genetic variation (variants) is believed to have important biological and medical implications. Identifying an individual’s single nucleotide genetic variants has become a standard first step in many biological and biomed- Fig. 1. A simplified view of computational ical applications. In this section, we stages in a SNP detection pipeline. describe the three key steps in the workflow to detect genetic variants, shown in Fig. 1, and mention prior approaches to exploit parallelism. Alignment. The output of a DNA sequencer is a set of reads. A read is a short segment of the genome whose sequence is known, but whose location in the genome is not known. The first step of this pipeline, Alignment, refers to identifying the location of the donor genome’s reads, by using an index built from the known reference genome. Alignment is the most computationally intensive task in the workflow. This step takes FASTQ (FQ) files containing the reads as input and produces output in the Sequence Alignment/Map (SAM) format [17]. There are several approaches to aligning reads against a reference genome. Usually, an alignment algorithm uses an index of the reference gen
Data Loading...