The application of deep learning for the classification of correct and incorrect SNP genotypes from whole-genome DNA seq
- PDF / 664,412 Bytes
- 10 Pages / 595.276 x 790.866 pts Page_size
- 78 Downloads / 145 Views
ANIMAL GENETICS • ORIGINAL PAPER
The application of deep learning for the classification of correct and incorrect SNP genotypes from whole-genome DNA sequencing pipelines Krzysztof Kotlarz 1 Joanna Szyda 1,2
&
Magda Mielczarek 1,2
&
Tomasz Suchocki 1,2
&
Bartosz Czech 1
&
Bernt Guldbrandtsen 3
&
Received: 26 August 2020 / Revised: 11 September 2020 / Accepted: 18 September 2020 # The Author(s) 2020
Abstract A downside of next-generation sequencing technology is the high technical error rate. We built a tool, which uses array-based genotype information to classify next-generation sequencing–based SNPs into the correct and the incorrect calls. The deep learning algorithms were implemented via Keras. Several algorithms were tested: (i) the basic, naïve algorithm, (ii) the naïve algorithm modified by pre-imposing different weights on incorrect and correct SNP class in calculating the loss metric and (iii)–(v) the naïve algorithm modified by random re-sampling (with replacement) of the incorrect SNPs to match 30%/60%/100% of the number of correct SNPs. The training data set was composed of data from three bulls and consisted of 2,227,995 correct (97.94%) and 46,920 incorrect SNPs, while the validation data set consisted of data from one bull with 749,506 correct (98.05%) and 14,908 incorrect SNPs. The results showed that for a rare event classification problem, like incorrect SNP detection in NGS data, the most parsimonious naïve model and a model with the weighting of SNP classes provided the best results for the classification of the validation data set. Both classified 19% of truly incorrect SNPs as incorrect and 99% of truly correct SNPs as correct and resulted in the F1 score of 0.21 — the highest among the compared algorithms. We conclude the basic models were less adapted to the specificity of a training data set and thus resulted in better classification of the independent, validation data set, than the other tested models. Keywords Classification . Keras . Next-generation sequencing . Python . SNP calling . SNP microarray . TensorFlow
Introduction Next-generation sequencing (NGS) technology has led to a tremendous increase in sequencing speed and a decrease in
Communicated by: Maciej Szydlowski Electronic supplementary material The online version of this article (https://doi.org/10.1007/s13353-020-00586-0) contains supplementary material, which is available to authorized users. * Joanna Szyda [email protected] 1
Biostatistics Group, Department of Genetics, Wroclaw University of Environmental and Life Sciences, Kozuchowska 7, 51-631 Wroclaw, Poland
2
Institute of Animal Breeding, Balice, Poland
3
Animal Breeding Group, Department of Animal Sciences, University of Bonn, Bonn, Germany
sequencing cost. It allows fast and cost-effective sequencing of whole genomes of many individuals. The downside of sequencing carried out by high-throughput processes are the significant technical (Pfeiffer et al. 2018; Ma et al. 2019) and bioinformatics (Abnizova et al. 2017) error rates. In particular, t
Data Loading...