Long-read-based human genomic structural variation detection with cuteSV

  • PDF / 1,915,900 Bytes
  • 24 Pages / 595.276 x 793.701 pts Page_size
  • 32 Downloads / 167 Views

DOWNLOAD

REPORT


METHOD

Open Access

Long-read-based human genomic structural variation detection with cuteSV Tao Jiang1†, Yongzhuang Liu1†, Yue Jiang2, Junyi Li3, Yan Gao1, Zhe Cui1, Yadong Liu1, Bo Liu1* and Yadong Wang1* * Correspondence: [email protected]. cn; [email protected] † Tao Jiang and Yongzhuang Liu contributed equally to this work. 1 Center for Bioinformatics, School of Computer Science and Technology, Harbin Institute of Technology, Harbin 150001, Heilongjiang, China Full list of author information is available at the end of the article

Abstract Long-read sequencing is promising for the comprehensive discovery of structural variations (SVs). However, it is still non-trivial to achieve high yields and performance simultaneously due to the complex SV signatures implied by noisy long reads. We propose cuteSV, a sensitive, fast, and scalable long-read-based SV detection approach. cuteSV uses tailored methods to collect the signatures of various types of SVs and employs a clustering-and-refinement method to implement sensitive SV detection. Benchmarks on simulated and real long-read sequencing datasets demonstrate that cuteSV has higher yields and scaling performance than state-ofthe-art tools. cuteSV is available at https://github.com/tjiangHIT/cuteSV. Keywords: Structural variants detection, Long-read sequencing, Scaling performance

Background Structural variations (SVs) represent genomic rearrangements such as deletions, insertions, inversions, duplications, and translocations whose sizes are larger than 50 bp [1]. As the largest divergences across human genomes [2], SVs are closely related to human diseases (e.g., inherited diseases [3–5] and cancers [6]), evolution (e.g., gene losses and transposon activity [7, 8]), gene regulations (e.g., rearrangements of transcription factors [9]), and other phenotypes (e.g., mating and intrinsic reproductive isolation [10, 11]). Efforts have been made to develop short-read-based SV calling approaches [12, 13]. Most of them use the methods such as read-depths [14], discordant read-pairs [15], split read alignments [16], local assembly [17], or their combinations [18–20], and they have played important roles in large-scale genomics studies such as 1000 Genomes Project [1]. However, the relatively low read length limits these tools to implement sensitive SV detection [21], and false positives exist as well [22]. With the rapid development of long-read sequencing technologies, such as Pacific Bioscience (PacBio) [23] and Oxford Nanopore Technology (ONT) [24] platforms, long-range spanning information provides the opportunity to more comprehensively detect SVs at a higher resolution [25]. However, novel computational approaches are required to well-handle the high sequencing error rates (typically 5–20%) and large © The Author(s). 2020 Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original au