CDKAM: a taxonomic classification tool using discriminative k-mers and approximate matching strategies

PDF / 3,281,542 Bytes
13 Pages / 595.276 x 790.866 pts Page_size
97 Downloads / 293 Views

pen Access

SOFTWARE

CDKAM: a taxonomic classification tool using discriminative k‑mers and approximate matching strategies Van‑Kien Bui1 and Chaochun Wei1,2*

*Correspondence: [email protected] 1 Department of Bioinformatics and Biostatistics, School of Life Sciences and Biotechnology, Shanghai Jiao Tong University, Shanghai 200240, China Full list of author information is available at the end of the article

Abstract Background: Current taxonomic classification tools use exact string matching algorithms that are effective to tackle the data from the next generation sequencing technology. However, the unique error patterns in the third generation sequencing (TGS) technologies could reduce the accuracy of these programs. Results: We developed a Classification tool using Discriminative K-mers and Approxi‑ mate Matching algorithm (CDKAM). This approximate matching method was used for searching k-mers, which included two phases, a quick mapping phase and a dynamic programming phase. Simulated datasets as well as real TGS datasets have been tested to compare the performance of CDKAM with existing methods. We showed that CDKAM performed better in many aspects, especially when classifying TGS data with average length 1000–1500 bases. Conclusions: CDKAM is an effective program with higher accuracy and lower memory requirement for TGS metagenome sequence classification. It produces a high species-level accuracy. Keywords: Third generation sequencing, Taxonomic classification, Discriminative k-mer, Approximate matching

Background Metagenome sequencing is a powerful approach to study microbial communities in natural environments [1]. In a pipeline for the metagenomics project, taxonomic classification aims to accurately assign each fragment to its corresponding host organism and is one of the most important initial steps. With the progress of sequencing technology, modern metagenomics methods need to deal with vast sequence datasets. Identifying taxa for billions of reads according to a reference database with many thousands of microbial genomes available today is becoming a time-consuming process. As the database from NCBI is continuously growing and being more complete, we have to consider the trade-off between the size of the reference database and the classification accuracy as well as the computational cost.

© The Author(s) 2020. Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the mate‑ rial. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted us

Data Loading...

CDKAM: a taxonomic classification tool using discriminative k-mers and approximate matching strategies

Recommend Documents

Approximate Regular Expression Matching

Indexed Approximate String Matching

Sequential Approximate String Matching

Approximate Dictionary Matching

Approximate Sub-graph Matching over Knowledge Graph

Stereo Matching Using Discriminative Feature-Oriented and Gradient-Constrained Dictionary Learning

Classification of Acoustic Emissions Using Modified Matching Pursuit

Discriminative Interpolation for Classification of Functional Data

Characterizing implementation strategies using a systems engineering survey and interview tool: a comparison across 10 p

Latent Embedding Feedback and Discriminative Features for Zero-Shot Classification

Eye Blinking Classification Through NeuroSky MindWave Headset Using EegID Tool

Deep Discriminative Learning for Autism Spectrum Disorder Classification