Estimating the quality of eukaryotic genomes recovered from metagenomic analysis with EukCC

  • PDF / 1,710,432 Bytes
  • 21 Pages / 595.276 x 793.701 pts Page_size
  • 37 Downloads / 161 Views

DOWNLOAD

REPORT


SOFTWARE

Open Access

Estimating the quality of eukaryotic genomes recovered from metagenomic analysis with EukCC Paul Saary , Alex L. Mitchell and Robert D. Finn* * Correspondence: [email protected] European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, Cambridge, UK

Abstract Microbial eukaryotes constitute a significant fraction of biodiversity and have recently gained more attention, but the recovery of high-quality metagenomic assembled eukaryotic genomes is limited by the current availability of tools. To help address this, we have developed EukCC, a tool for estimating the quality of eukaryotic genomes based on the automated dynamic selection of single copy marker gene sets. We demonstrate that our method outperforms current genome quality estimators, particularly for estimating contamination, and have applied EukCC to datasets derived from two different environments to enable the identification of novel eukaryote genomes, including one from the human skin. Keywords: Metagenomics, Eukaryotes, Genome quality estimation, Metagenome assembled genomes, Malassezia, Bathycoccus

Background The DNA of microorganisms is routinely extracted, sequenced and assembled into genomes, both from isolate cultures and within the context of metagenomic analyses. Estimating the quality of the recovered genome is crucial, to prevent incomplete or contaminated genomes from being published. Single copy marker genes (SCMGs) are routinely used to estimate the quality of a newly assembled genome. As these genes are expected to occur only once within a genome, comparing the number of SCMGs found within a draft genome to the number of expected marker genes provides an estimation of completeness, while additional copies of a marker gene can be used as an indicator of contamination. This approach has been widely accepted for prokaryotes and eukaryotes alike [1–4]. For prokaryotic genomes, CheckM [4] is the most widely used tool for estimating completeness and contamination, although other approaches have also been used and sets of prokaryotic SCMGs are also provided by BUSCO [2, 5] and anvi’o [6]. CheckM uses an initial set of universal SCMGs to identify the clade of a genome and subsequently uses clade specific sets to estimate the quality. © The Author(s). 2020 Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain pe