RepAHR: an improved approach for de novo repeat identification by assembly of the high-frequency reads

PDF / 2,505,832 Bytes
24 Pages / 595.276 x 790.866 pts Page_size
5 Downloads / 377 Views

METHODOLOGY ARTICLE

Open Access

RepAHR: an improved approach for de novo repeat identification by assembly of the high‑frequency reads Xingyu Liao1*† , Xin Gao2†, Xiankai Zhang1, Fang‑Xiang Wu3 and Jianxin Wang1 *Correspondence: [email protected] † Xingyu Liao and Xin Gao have contribututed equally to this work. 1 School of Computer Science and Engineering, Central South University, 932 South Lushan Rd, ChangSha 410083, China Full list of author information is available at the end of the article

Abstract Background: Repetitive sequences account for a large proportion of eukaryotes genomes. Identification of repetitive sequences plays a significant role in many appli‑ cations, such as structural variation detection and genome assembly. Many existing de novo repeat identification pipelines or tools make use of assembly of the highfrequency k-mers to obtain repeats. However, a certain degree of sequence coverage is required for assemblers to get the desired assemblies. On the other hand, assemblers cut the reads into shorter k-mers for assembly, which may destroy the structure of the repetitive regions. For the above reasons, it is difficult to obtain complete and accurate repetitive regions in the genome by using existing tools. Results: In this study, we present a new method called RepAHR for de novo repeat identification by assembly of the high-frequency reads. Firstly, RepAHR scans next-gen‑ eration sequencing (NGS) reads to find the high-frequency k-mers. Secondly, RepAHR filters the high-frequency reads from whole NGS reads according to certain rules based on the high-frequency k-mer. Finally, the high-frequency reads are assembled to gener‑ ate repeats by using SPAdes, which is considered as an outstanding genome assembler with NGS sequences. Conlusions: We test RepAHR on five data sets, and the experimental results show that RepAHR outperforms RepARK and REPdenovo for detecting repeats in terms of N50, reference alignment ratio, coverage ratio of reference, mask ratio of Repbase and some other metrics. Keywords: De novo repeat identification, NGS reads, The high-frequency k-mers, The high-frequency reads, Assembly

Background The repetitive sequences are patterns of nucleic acids, which occur multiple times in genome with the same or approximate form. Based on their structure and distribution in the genome, repetitive sequences are classified into several types, i.e. tandem repeats, interspersed repeats and so on. Tandem repeats consists of repetitive elements adjacent to each other and they are categorized into satellites, minisatellites and microsatellites based on their repetitive element size and repetitive level. © The Author(s) 2020. Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The im

Data Loading...

RepAHR: an improved approach for de novo repeat identification by assembly of the high-frequency reads

Recommend Documents

Error Correction in Nanopore Reads for de novo Genomic Assembly

GPU acceleration of Darwin read overlapper for de novo assembly of long DNA reads

FPGA-Based Acceleration of De Novo Genome Assembly

A Classification of de Bruijn Graph Approaches for De Novo Fragment Assembly

ARYANA: Aligning Reads by Yet Another Approach

De novo transcriptome assembly and population genetic analyses of an important coastal shrub, Apocynum venetum L

De Novo (Mutation)

De Novo Pathway

Development of a relevant strategy using de novo transcriptome assembly method for transcriptome comparisons between Mus

State of the art de novo assembly of human genomes from massively parallel sequencing data

GetOrganelle: a fast and versatile toolkit for accurate de novo assembly of organelle genomes

De novo assembly of transcriptome and genome-wide identification reveal GA 3 stress-responsive WRKY transcription factor