Understanding the causes of errors in eukaryotic protein-coding gene prediction: a case study of primate proteomes
- PDF / 1,588,806 Bytes
- 16 Pages / 595.276 x 790.866 pts Page_size
- 39 Downloads / 152 Views
RESEARCH ARTICLE
Open Access
Understanding the causes of errors in eukaryotic protein‑coding gene prediction: a case study of primate proteomes Corentin Meyer, Nicolas Scalzitti, Anne Jeannin‑Girardon, Pierre Collet, Olivier Poch and Julie D. Thompson*
*Correspondence: [email protected] Department of Computer Science, ICube, CNRS, University of Strasbourg, Strasbourg, France
Abstract Background: Recent advances in sequencing technologies have led to an explosion in the number of genomes available, but accurate genome annotation remains a major challenge. The prediction of protein-coding genes in eukaryotic genomes is especially problematic, due to their complex exon–intron structures. Even the best eukaryotic gene prediction algorithms can make serious errors that will significantly affect subse‑ quent analyses. Results: We first investigated the prevalence of gene prediction errors in a large set of 176,478 proteins from ten primate proteomes available in public databases. Using the well-studied human proteins as a reference, a total of 82,305 potential errors were detected, including 44,001 deletions, 27,289 insertions and 11,015 mismatched segments where part of the correct protein sequence is replaced with an alternative erroneous sequence. We then focused on the mismatched sequence errors that cause particular problems for downstream applications. A detailed characterization allowed us to identify the potential causes for the gene misprediction in approximately half (5446) of these cases. As a proof-of-concept, we also developed a simple method which allowed us to propose improved sequences for 603 primate proteins. Conclusions: Gene prediction errors in primate proteomes affect up to 50% of the sequences. Major causes of errors include undetermined genome regions, genome sequencing or assembly issues, and limitations in the models used to represent gene exon–intron structures. Nevertheless, existing genome sequences can still be exploited to improve protein sequence quality. Perspectives of the work include the characteri‑ zation of other types of gene prediction errors, as well as the development of a more comprehensive algorithm for protein sequence error correction. Keywords: Genome annotation, Primates, Gene prediction, Protein sequence errors, Error correction
Background An unprecedented number of genomes are being sequenced, offering a unique view of the specific characteristics of individual organisms and new opportunities to analyze life on a larger scale. An essential first step in the genome annotation process is © The Author(s) 2020. Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Common
Data Loading...