GEN2VCF: a converter for human genome imputation output format to VCF format

  • PDF / 708,111 Bytes
  • 6 Pages / 595.276 x 790.866 pts Page_size
  • 12 Downloads / 188 Views

DOWNLOAD

REPORT


Genes & Genomics (2020) 42:1163–1168 https://doi.org/10.1007/s13258-020-00982-0

RESEARCH ARTICLE

GEN2VCF: a converter for human genome imputation output format to VCF format Dong Mun Shin1,3 · Mi Yeong Hwang1 · Bong‑Jo Kim1 · Keun Ho Ryu2,3   · Young Jin Kim1  Received: 6 February 2020 / Accepted: 30 July 2020 / Published online: 16 August 2020 © The Author(s) 2020

Abstract Background  For a genome-wide association study in humans, genotype imputation is an essential analysis tool for improving association mapping power. When IMPUTE software is used for imputation analysis, an imputation output (GEN format) should be converted to variant call format (VCF) with imputed genotype dosage for association analysis. However, the conversion requires multiple software packages in a pipeline with a large amount of processing time. Objective  We developed GEN2VCF, a fast and convenient GEN format to VCF conversion tool with dosage support. Methods  The performance of GEN2VCF was compared to BCFtools, QCTOOL, and Oncofunco. The test data set was a 1 Mb GEN-formatted file of 5000 samples. To determine the performance of various sample sizes, tests were performed from 1000 to 5000 samples with a step size of 1000. Runtime and memory usage were used as performance measures. Results  GEN2VCF showed drastically increased performances with respect to runtime and memory usage. Runtime and memory usage of GEN2VCF was at least 1.4- and 7.4-fold lower compared to other methods, respectively. Conclusions  GEN2VCF provides users with efficient conversion from GEN format to VCF with the best-guessed genotype, genotype posterior probabilities, and genotype dosage, as well as great flexibility in implementation with other software packages in a pipeline. Keywords  Human genome · Imputation · SNP · Converter · Parsing

Introduction A genome-wide association study (GWAS) is a well-known approach to identify genetic variations associated with complex traits (Visscher et al. 2012). The GWAS Catalog * Keun Ho Ryu [email protected]; [email protected] * Young Jin Kim [email protected] 1



Division of Genome Research, Center for Genome Science, National Institute of Health, Osong Health Technology Administration Complex, 187, Osongsaengmyeong 2‑ro, Osong‑eup, Heungdeok‑gu, Cheongju‑si, Chungcheongbuk‑do 28159, Republic of Korea

2



Data Science Laboratory, Faculty of Information Technology, Ton Duc Thang University, Ho Chi Minh City 700000, Vietnam

3

Database and Bioinformatics Laboratory, Department of Computer Science, College of Electrical and Computer Engineering, Chungbuk National University, 28644 Cheongju, Republic of Korea



is a free online database that collects GWAS results. As of November 2019, the catalog contains 161,525 varianttrait associations from 4298 publications (https​://www.ebi. ac.uk/gwas/) (Buniello et al. 2019). In a GWAS, genotype imputation has been regarded as an essential analysis tool to improve the power of association mapping by estimating tens of millions of variants that are not directly genotyped