In silico benchmarking of metagenomic tools for coding sequence detection reveals the limits of sensitivity and precisio
- PDF / 1,168,824 Bytes
- 13 Pages / 595.276 x 790.866 pts Page_size
- 89 Downloads / 185 Views
ESEARCH ARTICLE
Open Access
In silico benchmarking of metagenomic tools for coding sequence detection reveals the limits of sensitivity and precision Jonathan Louis Golob1 and Samuel Schwartz Minot2*
*Correspondence: [email protected] 2 Microbiome Research Initiative, Fred Hutchinson Cancer Research Center, 1100 Fairview Ave N, E4‑100, Seattle, WA 98109‑1024, USA Full list of author information is available at the end of the article
Abstract Background: High-throughput sequencing can establish the functional capacity of a microbial community by cataloging the protein-coding sequences (CDS) present in the metagenome of the community. The relative performance of different computational methods for identifying CDS from whole-genome shotgun sequencing is not fully established. Results: Here we present an automated benchmarking workflow, using synthetic shotgun sequencing reads for which we know the true CDS content of the underlying communities, to determine the relative performance (sensitivity, positive predictive value or PPV, and computational efficiency) of different metagenome analysis tools for extracting the CDS content of a microbial community. Assembly-based methods are limited by coverage depth, with poor sensitivity for CDS at 1.0 CDMean
(1)
Golob and Minot BMC Bioinformatics
(2020) 21:459
where STD is standard deviation of per-base coverage values. 4 Calculate initial score for a given query coming from a subject using the alignment bitscores to weight the relative possibilities for a given query, normalizing the scores to total to 1 for a given query. 5 Iteratively, until no further references are pruned or a maximum number of iterations is reached: (1) WEIGHTING and RENORMALIZING: The score of queries being from a subject from the prior iteration are weighted by the sum of scores for a given subject, and then renormalized to sum to 1 for each query. (2) PRUNING. Determine the maximum likelihood for each query. Prune away all other likelihoods for the query below a threshold. 6 Repeat filtering steps 2–3 using the set of deduplicated alignments resulting from step 4. Here are some examples: For reference A and reference B that both have some aligning query reads, if there is uneven depth for reference A but relatively even depth across reference B, then reference A is removed from the candidate list while reference B is kept as a candidate. If query read #1 aligns equally-well to reference A and reference C, but there is 2× more query read depth for reference A as compared to reference C across the entire sample, then reference C’s alignment is removed from the list of candidates for query read #1. A more detailed description of the method is available in Additional file 1. An interactive demonstration of our algorithm is available as a Jupyter notebook is available at https://github.com/FredHutch/FAMLI/blob/master/schematic/FAMLI-schematic-figur e-GB.ipynb Comparison of FAMLI to HUMAnN2, SPAdes, top hit, and top 20 Simulation of microbial communities
Synthetic microbial communit
Data Loading...