A machine learning framework for genotyping the structural variations with copy number variant
- PDF / 4,915,017 Bytes
- 15 Pages / 595 x 791 pts Page_size
- 48 Downloads / 190 Views
RESEARCH
Open Access
A machine learning framework for genotyping the structural variations with copy number variant Tian Zheng1 , Xiaoyan Zhu1* , Xuanping Zhang1 , Zhongmeng Zhao1 , Xin Yi2 , Jiayin Wang1 and Hongle Li3* From 15th International Symposium on Bioinformatics Research and Applications (ISBRA ’19) Barcelona, Spain. 3–6 June 2019
Abstract Background: Genotyping of structural variation is an important computational problem in next generation sequence data analysis. However, in cancer genomes, the copy number variant(CNV) often coexists with other types of structural variations which significantly reduces the accuracy of the existing genotype methods. The bias on sequencing coverage and variant allelic frequency can be observed on a CNV region, which leads to the genotyping approaches that misinterpret the heterozygote as a homozygote. Furthermore, other data signals such as split mapped read, abnormal read will also be misjudged because of the CNV. Therefore, genotyping the structural variations with CNV is a complicated computational problem which should consider multiple features and their interactions. Methods: Here we proposed a computational method for genotyping indels in the CNV region, which introduced a machine learning framework to comprehensively incorporate a set of data features and their interactions. We extracted fifteen kinds of classification features as input and different from the traditional genotyping problem, here the structure of variant may fall into types of normal homozygote, homozygous variant, heterozygous variant without CNV, heterozygous variant with a CNV on the mutated haplotype, and heterozygous variant with a CNV on the wild haplotype. The Multiclass Relevance Vector Machine (M-RVM) was used as a machine learning framework combined with the distribution characteristics of the features. Results: We applied the proposed method to both simulated and real data, and compared it with the existing popular softwares include Gindel, Facets, GATK, and also compared with other machine learning cores: Support Vector Machine, Lanrange-SVM with OVO multiple classification, Naïve Bayes and BP Neural Network. The results demonstrated that the proposed method outperforms others on accuracy, stability and efficiency. Conclusion: This work shows that the genotyping of structural variations on the CNV region cannot be solved as a traditional genotyping problem. More features should be used to efficiently complete the five-category task. According to the result, the proposed method can be a practical algorithm to correct genotype structural variations (Continued on next page)
*Correspondence: [email protected]; [email protected] School of Computer Science and Technology, Xi’an Jiaotong University, Xi’an 710049, China 3 Department of Molecular Pathology, Henan Cancer Hospital, The Affiliated Cancer Hospital of Zhengzhou University, Zhengzhou 450003, China Full list of author information is available at the end of the article 1
© The Author(s). 2020 Open Access This article is licensed
Data Loading...