Construction of habitat-specific training sets to achieve species-level assignment in 16S rRNA gene datasets

PDF / 1,525,877 Bytes
16 Pages / 595.276 x 790.866 pts Page_size
42 Downloads / 318 Views

METHODOLOGY

Open Access

Construction of habitat-specific training sets to achieve species-level assignment in 16S rRNA gene datasets Isabel F. Escapa1,2,3†, Yanmei Huang1,2†, Tsute Chen1,2, Maoxuan Lin1, Alexis Kokaras1, Floyd E. Dewhirst1,2 and Katherine P. Lemon1,3,4,5*

Abstract Background: The low cost of 16S rRNA gene sequencing facilitates population-scale molecular epidemiological studies. Existing computational algorithms can resolve 16S rRNA gene sequences into high-resolution amplicon sequence variants (ASVs), which represent consistent labels comparable across studies. Assigning these ASVs to species-level taxonomy strengthens the ecological and/or clinical relevance of 16S rRNA gene-based microbiota studies and further facilitates data comparison across studies. Results: To achieve this, we developed a broadly applicable method for constructing high-resolution training sets based on the phylogenic relationships among microbes found in a habitat of interest. When used with the naïve Bayesian Ribosomal Database Project (RDP) Classifier, this training set achieved species/supraspecies-level taxonomic assignment of 16S rRNA gene-derived ASVs. The key steps for generating such a training set are (1) constructing an accurate and comprehensive phylogenetic-based, habitat-specific database; (2) compiling multiple 16S rRNA gene sequences to represent the natural sequence variability of each taxon in the database; (3) trimming the training set to match the sequenced regions, if necessary; and (4) placing species sharing closely related sequences into a training-set-specific supraspecies taxonomic level to preserve subgenus-level resolution. As proof of principle, we developed a V1–V3 region training set for the bacterial microbiota of the human aerodigestive tract using the fulllength 16S rRNA gene reference sequences compiled in our expanded Human Oral Microbiome Database (eHOMD). We also overcame technical limitations to successfully use Illumina sequences for the 16S rRNA gene V1– V3 region, the most informative segment for classifying bacteria native to the human aerodigestive tract. Finally, we generated a full-length eHOMD 16S rRNA gene training set, which we used in conjunction with an independent PacBio single molecule, real-time (SMRT)-sequenced sinonasal dataset to validate the representation of species in our training set. This also established the effectiveness of a full-length training set for assigning taxonomy of longread 16S rRNA gene datasets. (Continued on next page)

* Correspondence: [email protected] † Isabel F. Escapa and Yanmei Huang contributed equally to this work. 1 Forsyth Institute (Microbiology), Cambridge, MA, USA 3 Department of Molecular Virology & Microbiology, Alkek Center for Metagenomics & Microbiome Research, Baylor College of Medicine, Houston, TX, USA Full list of author information is available at the end of the article © The Author(s). 2020 Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharin

Data Loading...

Construction of habitat-specific training sets to achieve species-level assignment in 16S rRNA gene datasets

Recommend Documents

Reducing host DNA contamination in 16S rRNA gene surveys of anthozoan microbiomes using PNA clamps

16S rRNA Gene Copy Number Normalization Does Not Provide More Reliable Conclusions in Metataxonomic Surveys

Uncovering the microbiota in renal cell carcinoma tissue using 16S rRNA gene sequencing

The Ability of Taxonomic Identification of Bifidobacteria Based on the Variable Regions of 16S rRNA Gene

16S rRNA Gene and Transcript Profiling: an Application on Full-scale Anaerobic Reactors of Wastewater Sludges

Biological observations in microbiota analysis are robust to the choice of 16S rRNA gene sequencing processing algorithm

The Jawa and Bali Island Marine Fish Molecular Identification to Improve 12S rRNA-tRNA Valin-16S rRNA Partial Region Seq

Bacterial Communities in the Rhizosphere of Biofuel Crops Grown on Marginal Lands as Evaluated by 16S rRNA Gene Pyrosequ

Semantic Segmentation Datasets for Resource Constrained Training

Correction to: Expansion of acquired 16S rRNA methytransferases along with CTX-M-15, NDM and OXA-48 within three sequenc

Characterization of the 26S-rRNA Gene to Classify an Industrial Strain to be Candida maltosa

Ultrafast and accurate 16S rRNA microbial community analysis using Kraken 2