SPRITE: A Fast Parallel SNP Detection Pipeline

We present Sprite , a new high-performance data analysis pipeline for detecting single nucleotide polymorphisms (SNPs) in the human genome. A SNP detection pipeline for next-generation sequencing data uses several software tools, including tools for read

PDF / 816,250 Bytes
19 Pages / 439.37 x 666.142 pts Page_size
10 Downloads / 215 Views

DOWNLOAD

REPORT

roduction

In this work, we consider the pervasive genetic variation detection workﬂow in biomedical informatics. The goal of this workﬂow is to automatically determine genetic variations present in the genome of an individual (called the donor ), by comparing it to a reference genome. SNPs are nucleotide diﬀerences at a single position and account for nearly 90 % of the total variations. Detecting SNPs with high accuracy plays a very important role in identifying disease risk, studying drug eﬃcacy [31], etc. SNP detection using current state-of-the-art tools can take more than a day of sequential compute time, and the pipeline is typically I/O bound. In this paper, we focus on improving end-to-end eﬃciency and parallel scaling of this pipeline, and design new hybrid parallel algorithms and software (Sprite, comprised of prune, sampa, parsnip). The end-to-end running time of Sprite on the Stampede supercomputer is 11.7 hours on a single compute node, and 48 min on 16 nodes, for a realistic input data set. In comparison, the end-to-end time using current state-of-the-art tools on a single compute node is 23 hours, and so we achieve a speedup of 1.97× and 28.7× using single node and 16 nodes respectively. We also show that the resulting SNP detection quality is comparable to two state-of-the-art variant detection pipelines. Further, we create Sprite+ , an in-memory version that does not generate intermediate ﬁles. Sprite+ can be executed on just a few compute nodes (requiring about 105 GB aggregate main memory for the human genome). c Springer International Publishing Switzerland 2016 J.M. Kunkel et al. (Eds.): ISC High Performance 2016, LNCS 9697, pp. 159–177, 2016. DOI: 10.1007/978-3-319-41321-1 9

160

2

V. Rengasamy and K. Madduri

Background: Variant Detection Pipelines

The genome sequences of any two (human) individuals are highly similar. However, the small percentage of genetic variation (variants) is believed to have important biological and medical implications. Identifying an individual’s single nucleotide genetic variants has become a standard ﬁrst step in many biological and biomed- Fig. 1. A simpliﬁed view of computational ical applications. In this section, we stages in a SNP detection pipeline. describe the three key steps in the workﬂow to detect genetic variants, shown in Fig. 1, and mention prior approaches to exploit parallelism. Alignment. The output of a DNA sequencer is a set of reads. A read is a short segment of the genome whose sequence is known, but whose location in the genome is not known. The ﬁrst step of this pipeline, Alignment, refers to identifying the location of the donor genome’s reads, by using an index built from the known reference genome. Alignment is the most computationally intensive task in the workﬂow. This step takes FASTQ (FQ) ﬁles containing the reads as input and produces output in the Sequence Alignment/Map (SAM) format [17]. There are several approaches to aligning reads against a reference genome. Usually, an alignment algorithm uses an index of the reference gen

Data Loading...

SPRITE: A Fast Parallel SNP Detection Pipeline

Recommend Documents

A Fast Heuristic to Pipeline SDF Graphs

SNP Detection and Mass Spectrometry

Blockage Detection in Pipeline

Sprite Sheets

Fast Fourier Transform Algorithms for Parallel Computers

A development strategy to fast establish the Taqman qPCR based method to detect SNP mutations

Stereo Frustums: a Siamese Pipeline for 3D Object Detection

An Improved Ant Colony Optimization Algorithm for the Detection of SNP-SNP Interactions

A Review on Different Pipeline Defect Detection Techniques

SNP

Vibration analysis of aero parallel-pipeline systems based on a novel reduced order modeling method

A fast marine sewage detection method for remote-sensing image