The application of deep learning for the classification of correct and incorrect SNP genotypes from whole-genome DNA seq

PDF / 664,412 Bytes
10 Pages / 595.276 x 790.866 pts Page_size
78 Downloads / 162 Views

ANIMAL GENETICS • ORIGINAL PAPER

The application of deep learning for the classification of correct and incorrect SNP genotypes from whole-genome DNA sequencing pipelines Krzysztof Kotlarz 1 Joanna Szyda 1,2

&

Magda Mielczarek 1,2

&

Tomasz Suchocki 1,2

&

Bartosz Czech 1

&

Bernt Guldbrandtsen 3

&

Received: 26 August 2020 / Revised: 11 September 2020 / Accepted: 18 September 2020 # The Author(s) 2020

Abstract A downside of next-generation sequencing technology is the high technical error rate. We built a tool, which uses array-based genotype information to classify next-generation sequencing–based SNPs into the correct and the incorrect calls. The deep learning algorithms were implemented via Keras. Several algorithms were tested: (i) the basic, naïve algorithm, (ii) the naïve algorithm modified by pre-imposing different weights on incorrect and correct SNP class in calculating the loss metric and (iii)–(v) the naïve algorithm modified by random re-sampling (with replacement) of the incorrect SNPs to match 30%/60%/100% of the number of correct SNPs. The training data set was composed of data from three bulls and consisted of 2,227,995 correct (97.94%) and 46,920 incorrect SNPs, while the validation data set consisted of data from one bull with 749,506 correct (98.05%) and 14,908 incorrect SNPs. The results showed that for a rare event classification problem, like incorrect SNP detection in NGS data, the most parsimonious naïve model and a model with the weighting of SNP classes provided the best results for the classification of the validation data set. Both classified 19% of truly incorrect SNPs as incorrect and 99% of truly correct SNPs as correct and resulted in the F1 score of 0.21 — the highest among the compared algorithms. We conclude the basic models were less adapted to the specificity of a training data set and thus resulted in better classification of the independent, validation data set, than the other tested models. Keywords Classification . Keras . Next-generation sequencing . Python . SNP calling . SNP microarray . TensorFlow

Introduction Next-generation sequencing (NGS) technology has led to a tremendous increase in sequencing speed and a decrease in

Communicated by: Maciej Szydlowski Electronic supplementary material The online version of this article (https://doi.org/10.1007/s13353-020-00586-0) contains supplementary material, which is available to authorized users. * Joanna Szyda [email protected] 1

Biostatistics Group, Department of Genetics, Wroclaw University of Environmental and Life Sciences, Kozuchowska 7, 51-631 Wroclaw, Poland

2

Institute of Animal Breeding, Balice, Poland

3

Animal Breeding Group, Department of Animal Sciences, University of Bonn, Bonn, Germany

sequencing cost. It allows fast and cost-effective sequencing of whole genomes of many individuals. The downside of sequencing carried out by high-throughput processes are the significant technical (Pfeiffer et al. 2018; Ma et al. 2019) and bioinformatics (Abnizova et al. 2017) error rates. In particular, t

Data Loading...

The application of deep learning for the classification of correct and incorrect SNP genotypes from whole-genome DNA seq

Recommend Documents

Correction to: The application of deep learning for the classification of correct and incorrect SNP genotypes from whole

Application of Deep Learning to Seizure Classification

Deep Neural Networks for Supervised Learning: Classification

SNP discovery in spotted halibut ( Verasper variegatus ) using restriction site-associated DNA sequencing(RAD-seq)

Analysis and Classification of Urinary Stones Using Deep Learning Algorithm: A Clinical Application of Radiology-Common

Deep Learning Model for Classification of Breast Cancer

Deep Learning for Taxonomic Classification of Biological Bacterial Sequences

Deep Learning in Malware Identification and Classification

Deep Learning for Classification of Cricket Umpire Postures

SNP in Forensic DNA Testing

Image Classification Model Using Deep Learning on the Edge Device

Application of deep learning in genomics