Haplotyping a Diploid Single Individual with a Fast and Accurate Enumeration Algorithm

The minimum error correction (MEC) model is one of the important computational models for determining haplotype information from sequencing data, i.e., single individual single nucleotide polymorphism (SNP) haplotyping, haplotype reconstruction or haploty

  • PDF / 476,862 Bytes
  • 13 Pages / 439.37 x 666.142 pts Page_size
  • 13 Downloads / 141 Views

DOWNLOAD

REPORT


Guangxi Key Lab of Multi-source Information Mining and Security, Guangxi Normal University, Guilin 541004, China [email protected] 2 College of Computer Science and Information Technology, Guangxi Normal University, Guilin 541004, China

Abstract. The minimum error correction (MEC) model is one of the important computational models for determining haplotype information from sequencing data, i.e., single individual single nucleotide polymorphism (SNP) haplotyping, haplotype reconstruction or haplotype assembly. Due to the NP-hardness of the model, a fast and accurate enumeration algorithm is proposed for solving it. The presented algorithm reconstructs the SNP sites of a pair of haplotypes one after another. It enumerates two kinds of SNP values, i.e., (0 1)T and (1 0)T, for the SNP site being reconstructed, and chooses the one with more support coming from the SNP fragments that are covering the corresponding SNP site. The experimental comparisons were conducted among the presented algorithm, the FAHR, the Fast Hare and the DGS algorithms. The results prove that our algorithm can get higher reconstruction rate than the other three algorithms. Keywords: Sequence analysis  Haplotype (MEC)  Algorithm  Enumeration



Minimum error correction

1 Introduction Genetic difference has received increasingly extensive attention with the development of high-throughput DNA sequencing technologies. It is well known that all human are almost identical at DNA level (*99.5 % identical). Genomic differences located in *0.5 % nucleotide bases answer for the observed diversities in our phenotypes. Among various genetic variations, single nucleotide polymorphisms (SNPs) are believed to be the most widespread form of variation, and their significance cannot be overvalued for medical, drug-design, diagnostic and forensic applications [1]. Due to linkage disequilibrium (LD) and the absence of recombination event, adjacent SNPs tend to inherit together from ancestors to descendants. A sequence of related SNPs on each copy of a pair of chromosomes is referred to as a haplotype. Studies indicate that haplotypes generally carry more genetic information than individual SNP, and they play a crucial role in gene expression and disease association

© Springer International Publishing Switzerland 2016 D.-S. Huang et al. (Eds.): ICIC 2016, Part I, LNCS 9771, pp. 399–411, 2016. DOI: 10.1007/978-3-319-42291-6_40

400

X. Chen et al.

studies [2, 3]. Since it is both time consuming and expensive to obtain haplotypes through biological experiments directly, it is common to determine haplotypes through computational methods. Three categories of computational methods exist for computing haplotype [4]: haplotyping a population based on population genotype data [5, 6]; haplotyping a population based on sequenced fragment data [7]; haplotyping a single individual based on sequenced fragment data [8]. The third category is studied in this paper. The problem of haplotyping a single individual is also called haplotype reconstruction or haplotype