A simple data-adaptive probabilistic variant calling model

  • PDF / 1,155,718 Bytes
  • 10 Pages / 595 x 794 pts Page_size
  • 44 Downloads / 239 Views

DOWNLOAD

REPORT


RESEARCH

Open Access

A simple data-adaptive probabilistic variant calling model Steve Hoffmann1,2,3* , Peter F Stadler2,3,4,5,6,7,8 and Korbinian Strimmer9,10 Abstract Background: Several sources of noise obfuscate the identification of single nucleotide variation (SNV) in next generation sequencing data. For instance, errors may be introduced during library construction and sequencing steps. In addition, the reference genome and the algorithms used for the alignment of the reads are further critical factors determining the efficacy of variant calling methods. It is crucial to account for these factors in individual sequencing experiments. Results: We introduce a simple data-adaptive model for variant calling. This model automatically adjusts to specific factors such as alignment errors. To achieve this, several characteristics are sampled from sites with low mismatch rates, and these are used to estimate empirical log-likelihoods. The likelihoods are then combined to a score that typically gives rise to a mixture distribution. From this we determine a decision threshold to separate potentially variant sites from the noisy background. Conclusions: In simulations we show that our simple model is competitive with frequently used much more complex SNV calling algorithms in terms of sensitivity and specificity. It performs specifically well in cases with low allele frequencies. The application to next-generation sequencing data reveals stark differences of the score distributions indicating a strong influence of data specific sources of noise. The proposed model is specifically designed to adjust to these differences. Background Recent studies report a strikingly low concordance of currently available methods and pipelines for identification of single nucleotide variation (SNV), both somatic and germline, indicating that computational methods as well as sequencing protocols have a major impact on the sensitivity and specificity of the variation calling tool [1]. Specifically, the allelic fraction as well as the coverage of the variant allele are crucial determinants for the statistical benchmarks [2,3]. Practical guidelines of SNV callers such as GATK [4] or SAMtools [5] suggest to apply rigorous postprocessing filters to reduce the number of false positive calls. Other studies indicate that the application of these filters lead to a substantial improvement of the concordance of the callers [6]. Nevertheless, applying *Correspondence: [email protected] 1 Junior Research Group Transcriptome Bioinformatics, University Leipzig, Härtelstraße 16-18, Leipzig, Germany 2 Interdisciplinary Center for Bioinformatics and Bioinformatics Group, University Leipzig, Härtelstraße 16-18, Leipzig, Germany Full list of author information is available at the end of the article

stringent thresholds for variables such as the strand bias, the coverage or read start variation bears the risk of losing important information [7]. These authors emphasize that the different algorithmic and statistical components of a variant caller ha