Understanding the causes of errors in eukaryotic protein-coding gene prediction: a case study of primate proteomes

PDF / 1,588,806 Bytes
16 Pages / 595.276 x 790.866 pts Page_size
39 Downloads / 254 Views

RESEARCH ARTICLE

Open Access

Understanding the causes of errors in eukaryotic protein‑coding gene prediction: a case study of primate proteomes Corentin Meyer, Nicolas Scalzitti, Anne Jeannin‑Girardon, Pierre Collet, Olivier Poch and Julie D. Thompson*

*Correspondence: [email protected] Department of Computer Science, ICube, CNRS, University of Strasbourg, Strasbourg, France

Abstract Background: Recent advances in sequencing technologies have led to an explosion in the number of genomes available, but accurate genome annotation remains a major challenge. The prediction of protein-coding genes in eukaryotic genomes is especially problematic, due to their complex exon–intron structures. Even the best eukaryotic gene prediction algorithms can make serious errors that will significantly affect subse‑ quent analyses. Results: We first investigated the prevalence of gene prediction errors in a large set of 176,478 proteins from ten primate proteomes available in public databases. Using the well-studied human proteins as a reference, a total of 82,305 potential errors were detected, including 44,001 deletions, 27,289 insertions and 11,015 mismatched segments where part of the correct protein sequence is replaced with an alternative erroneous sequence. We then focused on the mismatched sequence errors that cause particular problems for downstream applications. A detailed characterization allowed us to identify the potential causes for the gene misprediction in approximately half (5446) of these cases. As a proof-of-concept, we also developed a simple method which allowed us to propose improved sequences for 603 primate proteins. Conclusions: Gene prediction errors in primate proteomes affect up to 50% of the sequences. Major causes of errors include undetermined genome regions, genome sequencing or assembly issues, and limitations in the models used to represent gene exon–intron structures. Nevertheless, existing genome sequences can still be exploited to improve protein sequence quality. Perspectives of the work include the characteri‑ zation of other types of gene prediction errors, as well as the development of a more comprehensive algorithm for protein sequence error correction. Keywords: Genome annotation, Primates, Gene prediction, Protein sequence errors, Error correction

Background An unprecedented number of genomes are being sequenced, offering a unique view of the specific characteristics of individual organisms and new opportunities to analyze life on a larger scale. An essential first step in the genome annotation process is © The Author(s) 2020. Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Common

Data Loading...

Understanding the causes of errors in eukaryotic protein-coding gene prediction: a case study of primate proteomes

Recommend Documents

Understanding performance measurement dynamism: a case study

The Epistemology of Violence Understanding the Root Causes of Violen

Enhancing the safety management of NATM using the tunnel seismic prediction method: a case study

Understanding Business Models in the Sharing Economy in China: A Case Study

Third-Rail Insulator Failure Causes and Mitigating Practices: A Comparative Study of Multiple Case Studies in the U.S.

Understanding pollen specialization in mason bees: a case study of six species

A study of the errors of the averaged models in the restricted three-body problem in a short time scale

Failed Detection of Egregious Errors in Clinical Case Scenarios

A case study in the valuation of a database

Analysis of VIN Errors in Information Systems, Causes, Consequences and Solutions

Management of microbial contaminants in the In Vitro Gene Bank: a case study of taro [ Colocasia esculenta (L.) Schott]

Evolution of Primate Social Cognition