Haplotype threading: accurate polyploid phasing from long reads

  • PDF / 2,029,670 Bytes
  • 22 Pages / 595 x 794 pts Page_size
  • 83 Downloads / 181 Views

DOWNLOAD

REPORT


METHOD

Open Access

Haplotype threading: accurate polyploid phasing from long reads Sven D. Schrinner1† , Rebecca Serra Mari2,3,4† , Jana Ebler2† , Mikko Rautiainen5,3,4 , Lancelot Seillier7 , Julia J. Reimer7 , Björn Usadel6,7,8 , Tobias Marschall2*† and Gunnar W. Klau1,8*† *Correspondence: [email protected]; [email protected] † Sven D. Schrinner, Rebecca Serra Mari and Jana Ebler are joint first authors. † Tobias Marschall and Gunnar W. Klau are joint last authors. 2 Institute for Medical Biometry and Bioinformatics, Medical Faculty, Heinrich Heine University Düsseldorf, Moorenstraße 5, 40225 Düsseldorf, Germany 1 Algorithmic Bioinformatics, Heinrich Heine University Düsseldorf, Universitätsstr. 1, 40225 Düsseldorf, Germany Full list of author information is available at the end of the article

Abstract Resolving genomes at haplotype level is crucial for understanding the evolutionary history of polyploid species and for designing advanced breeding strategies. Polyploid phasing still presents considerable challenges, especially in regions of collapsing haplotypes. We present WHATSHAP POLYPHASE, a novel two-stage approach that addresses these challenges by (i) clustering reads and (ii) threading the haplotypes through the clusters. Our method outperforms the state-of-the-art in terms of phasing quality. Using a real tetraploid potato dataset, we demonstrate how to assemble local genomic regions of interest at the haplotype level. Our algorithm is implemented as part of the widely used open source tool WhatsHap. Keywords: Polyploidy, Phasing, Haplotypes, Cluster editing, High-throughput nucleotide sequencing, Plant science, Sequence analysis

Background Polyploid genomes have more than two homologous sets of chromosomes. Polyploidy is common to many plant species, including important food crops like potato (Solanum tuberosum), bread wheat (Triticum aestivum), and durum wheat (Triticum durum). Resolving polyploid genomes at the haplotype level, i.e., assembling the sequences of alleles residing on the same chromosome, is crucial for understanding the evolutionary history of polyploid species: Evolutionary events, such as whole-genome duplications, can be traced back and reveal the ancestry of polyploid organisms [1]. Beyond that, knowledge of haplotypes is key for advanced breeding strategies or genome engineering, especially for improving yield quality in important crop species [1–3]. In this work, we focus on phasing from long read information. Plant genomes typically exhibit many highly repetitive regions and frequently underwent structural variation events, rendering alignments from short reads alone problematic. Although long reads suffer from a higher number of sequencing errors, they align better to the reference

© The Author(s). 2020 Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the s