Verification and validation of bioinformatics software without a gold standard: a case study of BWA and Bowtie

  • PDF / 472,077 Bytes
  • 8 Pages / 595.276 x 793.701 pts Page_size
  • 19 Downloads / 167 Views

DOWNLOAD

REPORT


RESEARCH

Open Access

Verification and validation of bioinformatics software without a gold standard: a case study of BWA and Bowtie Eleni Giannoulatou1,2, Shin-Ho Park1,2, David T Humphreys1,2, Joshua WK Ho1,2* From Asia Pacific Bioinformatics Network (APBioNet) Thirteenth International Conference on Bioinformatics (InCoB2014) Sydney, Australia. 31 July - 2 August 2014

Abstract Background: Bioinformatics software quality assurance is essential in genomic medicine. Systematic verification and validation of bioinformatics software is difficult because it is often not possible to obtain a realistic “gold standard” for systematic evaluation. Here we apply a technique that originates from the software testing literature, namely Metamorphic Testing (MT), to systematically test three widely used short-read sequence alignment programs. Results: MT alleviates the problems associated with the lack of gold standard by checking that the results from multiple executions of a program satisfy a set of expected or desirable properties that can be derived from the software specification or user expectations. We tested BWA, Bowtie and Bowtie2 using simulated data and one HapMap dataset. It is interesting to observe that multiple executions of the same aligner using slightly modified input FASTQ sequence file, such as after randomly re-ordering of the reads, may affect alignment results. Furthermore, we found that the list of variant calls can be affected unless strict quality control is applied during variant calling. Conclusion: Thorough testing of bioinformatics software is important in delivering clinical genomic medicine. This paper demonstrates a different framework to test a program that involves checking its properties, thus greatly expanding the number and repertoire of test cases we can apply in practice.

Background The advent of high-throughput Next Generation Sequencing (NGS) technologies has greatly accelerated the pace of disease gene discoveries and has revolutionised the diagnosis and management of human genetic diseases and cancer [1-5]. Being able to reconstruct the genetic make-up of an individual and accurately predict the effect of pathogenic genetic variants is essential for genetic counselling and making informed decisions regarding medical treatment. The age of personalised genomic medicine is upon us. New bioinformatics tools are being developed at a very rapid pace to analyse such * Correspondence: [email protected] 1 Victor Chang Cardiac Research Institute, Darlinghurst, NSW, Australia Full list of author information is available at the end of the article

datasets and to cope with the constant generation of new types of “omic” data [6]. Software quality assurance becomes especially critical if bioinformatics tools are to be used in a translational medical setting, such as analysis and interpretation of Whole Exome Sequencing (WES) or Whole Genome Sequencing (WGS) data. We must ensure that only validated algorithms are used, and that they are implemented correctly in the analysis pipeline. More im