RepAHR: an improved approach for de novo repeat identification by assembly of the high-frequency reads
- PDF / 2,505,832 Bytes
- 24 Pages / 595.276 x 790.866 pts Page_size
- 5 Downloads / 242 Views
METHODOLOGY ARTICLE
Open Access
RepAHR: an improved approach for de novo repeat identification by assembly of the high‑frequency reads Xingyu Liao1*† , Xin Gao2†, Xiankai Zhang1, Fang‑Xiang Wu3 and Jianxin Wang1 *Correspondence: [email protected] † Xingyu Liao and Xin Gao have contribututed equally to this work. 1 School of Computer Science and Engineering, Central South University, 932 South Lushan Rd, ChangSha 410083, China Full list of author information is available at the end of the article
Abstract Background: Repetitive sequences account for a large proportion of eukaryotes genomes. Identification of repetitive sequences plays a significant role in many appli‑ cations, such as structural variation detection and genome assembly. Many existing de novo repeat identification pipelines or tools make use of assembly of the highfrequency k-mers to obtain repeats. However, a certain degree of sequence coverage is required for assemblers to get the desired assemblies. On the other hand, assemblers cut the reads into shorter k-mers for assembly, which may destroy the structure of the repetitive regions. For the above reasons, it is difficult to obtain complete and accurate repetitive regions in the genome by using existing tools. Results: In this study, we present a new method called RepAHR for de novo repeat identification by assembly of the high-frequency reads. Firstly, RepAHR scans next-gen‑ eration sequencing (NGS) reads to find the high-frequency k-mers. Secondly, RepAHR filters the high-frequency reads from whole NGS reads according to certain rules based on the high-frequency k-mer. Finally, the high-frequency reads are assembled to gener‑ ate repeats by using SPAdes, which is considered as an outstanding genome assembler with NGS sequences. Conlusions: We test RepAHR on five data sets, and the experimental results show that RepAHR outperforms RepARK and REPdenovo for detecting repeats in terms of N50, reference alignment ratio, coverage ratio of reference, mask ratio of Repbase and some other metrics. Keywords: De novo repeat identification, NGS reads, The high-frequency k-mers, The high-frequency reads, Assembly
Background The repetitive sequences are patterns of nucleic acids, which occur multiple times in genome with the same or approximate form. Based on their structure and distribution in the genome, repetitive sequences are classified into several types, i.e. tandem repeats, interspersed repeats and so on. Tandem repeats consists of repetitive elements adjacent to each other and they are categorized into satellites, minisatellites and microsatellites based on their repetitive element size and repetitive level. © The Author(s) 2020. Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The im
Data Loading...