CDKAM: a taxonomic classification tool using discriminative k-mers and approximate matching strategies

  • PDF / 3,281,542 Bytes
  • 13 Pages / 595.276 x 790.866 pts Page_size
  • 97 Downloads / 191 Views

DOWNLOAD

REPORT


pen Access

SOFTWARE

CDKAM: a taxonomic classification tool using discriminative k‑mers and approximate matching strategies Van‑Kien Bui1 and Chaochun Wei1,2* 

*Correspondence: [email protected] 1 Department of Bioinformatics and Biostatistics, School of Life Sciences and Biotechnology, Shanghai Jiao Tong University, Shanghai 200240, China Full list of author information is available at the end of the article

Abstract  Background:  Current taxonomic classification tools use exact string matching algorithms that are effective to tackle the data from the next generation sequencing technology. However, the unique error patterns in the third generation sequencing (TGS) technologies could reduce the accuracy of these programs. Results:  We developed a Classification tool using Discriminative K-mers and Approxi‑ mate Matching algorithm (CDKAM). This approximate matching method was used for searching k-mers, which included two phases, a quick mapping phase and a dynamic programming phase. Simulated datasets as well as real TGS datasets have been tested to compare the performance of CDKAM with existing methods. We showed that CDKAM performed better in many aspects, especially when classifying TGS data with average length 1000–1500 bases. Conclusions:  CDKAM is an effective program with higher accuracy and lower memory requirement for TGS metagenome sequence classification. It produces a high species-level accuracy. Keywords:  Third generation sequencing, Taxonomic classification, Discriminative k-mer, Approximate matching

Background Metagenome sequencing is a powerful approach to study microbial communities in natural environments [1]. In a pipeline for the metagenomics project, taxonomic classification aims to accurately assign each fragment to its corresponding host organism and is one of the most important initial steps. With the progress of sequencing technology, modern metagenomics methods need to deal with vast sequence datasets. Identifying taxa for billions of reads according to a reference database with many thousands of microbial genomes available today is becoming a time-consuming process. As the database from NCBI is continuously growing and being more complete, we have to consider the trade-off between the size of the reference database and the classification accuracy as well as the computational cost.

© The Author(s) 2020. Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the mate‑ rial. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted us