Keeping up with the genomes: efficient learning of our increasing knowledge of the tree of life

PDF / 2,573,225 Bytes
23 Pages / 595 x 794 pts Page_size
80 Downloads / 274 Views

RESEARCH ARTICLE

Open Access

Keeping up with the genomes: efficient learning of our increasing knowledge of the tree of life Zhengqiao Zhao1 , Alexandru Cristian2 and Gail Rosen1* *Correspondence: [email protected] 1 Ecological and Evolutionary Signal-process and Informatics (EESI) Lab, Department of Electrical and Computer Engineering, Drexel University, Market Street, Philadelphia, US Full list of author information is available at the end of the article

Abstract Background: It is a computational challenge for current metagenomic classifiers to keep up with the pace of training data generated from genome sequencing projects, such as the exponentially-growing NCBI RefSeq bacterial genome database. When new reference sequences are added to training data, statically trained classifiers must be rerun on all data, resulting in a highly inefficient process. The rich literature of “incremental learning” addresses the need to update an existing classifier to accommodate new data without sacrificing much accuracy compared to retraining the classifier with all data. Results: We demonstrate how classification improves over time by incrementally training a classifier on progressive RefSeq snapshots and testing it on: (a) all known current genomes (as a ground truth set) and (b) a real experimental metagenomic gut sample. We demonstrate that as a classifier model’s knowledge of genomes grows, classification accuracy increases. The proof-of-concept naïve Bayes implementation, when updated yearly, now runs in 1/4th of the non-incremental time with no accuracy loss. Conclusions: It is evident that classification improves by having the most current knowledge at its disposal. Therefore, it is of utmost importance to make classifiers computationally tractable to keep up with the data deluge. The incremental learning classifier can be efficiently updated without the cost of reprocessing nor the access to the existing database and therefore save storage as well as computation resources. Keywords: Incremental learning, Naïve Bayes taxanomic classifier, RefSeq, Metagenomics

Background Recent advances in genomics have resulted in exponential increases in the rate at which data is collected. Inspired by Zynda [1], we visualize the growth of National Center for Biotechnology Information (NCBI) Reference Sequence (RefSeq) bacterial genome database [2, 3] in Fig. 1. Figure 1a shows the total number of complete genomes

© The Author(s). 2020 Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and

Data Loading...

Keeping up with the genomes: efficient learning of our increasing knowledge of the tree of life

Recommend Documents

Conclusion: The Tree of Life

Keeping Up with the Trends: Analyzing the Dynamics of Online Learning and Hiring Platforms in the Software Programming D

Keeping up with the Finishing School Myth: The Role of Communication in Contemporary Indian Management Education

The Future of Life and the Future of our Civilization

An Improved Energy Efficient Clustering Protocol for Increasing the Life Time of Wireless Sensor Networks

Keeping Wind in Your Sail: Keeping Up with Tools, Techniques, and Technology

Protein-protein interaction databases: keeping up with growing interactomes

Setting Up Our Infrastructure

Setting Up Our Development Environment

Increasing the take-up of the housing allowance among Swedish pensioners: a field experiment

The Duckweed Genomes

The Impact of Climate Change on Our Life The Questions of Sustainabi