A Survey of Computational Methods for Protein Function Prediction

Rapid advances in high-throughout genome sequencing technologies have resulted in millions of protein-encoding gene sequences with no functional characterization. Automated protein function annotation or prediction is a prime problem for computational met

  • PDF / 1,221,456 Bytes
  • 74 Pages / 439.36 x 666.15 pts Page_size
  • 71 Downloads / 219 Views

DOWNLOAD

REPORT


Abstract Rapid advances in high-throughout genome sequencing technologies have resulted in millions of protein-encoding gene sequences with no functional characterization. Automated protein function annotation or prediction is a prime problem for computational methods to tackle in the post-genomic era of big molecular data. While recent community-driven experiments demonstrate that the accuracy of function prediction methods has significantly improved, challenges remain. The latter are related to the different sources of data exploited to predict function, as well as different choices in representing and integrating heterogeneous data. Current methods predict function from a protein’s sequence, often in the context of evolutionary relationships, from a protein’s three-dimensional structure or specific patterns in the structure, from neighbors in a protein–protein interaction network, from microarray data, or a combination of these different types of data. Here we review these methods and the state of protein function prediction, emphasizing recent algorithmic developments, remaining challenges, and prospects for future research. Keywords Computational biology • Protein function prediction • Algorithms • Machine learning • Homology

A. Shehu () Department of Computer Science, Department of Bioengineering, George Mason University, Fairfax, VA 22030, USA e-mail: [email protected] D. Barbará Department of Computer Science, George Mason University, Fairfax, VA 22030, USA e-mail: [email protected] K. Molloy LAAS-CNRS, 7, avenue du Colonel Roche, 31077 Toulouse, France e-mail: [email protected] © Springer International Publishing Switzerland 2016 K.-C. Wong (ed.), Big Data Analytics in Genomics, DOI 10.1007/978-3-319-41279-5_7

225

226

A. Shehu et al.

1 Introduction Molecular biology now finds itself in the era of big data. The focus of the field on high-throughout, automated wet-laboratory protocols has resulted in a vast amount of gene sequence, expression, interactions, and protein structure data [212]. In particular, due to the increasingly fast pace with which whole genomes can be sequenced, we are now faced with millions of protein products for which no functional information is readily available [39, 198]. The December 2015 release of the Universal Protein (UniProt) database [68] contains a little over 55:2 million sequences, less than 1 % of which have reliable and detailed annotations. The gap between unannotated and annotated gene/protein sequences has exceeded two orders of magnitude. Fundamental information is currently missing for 40 % of the protein sequences deposited in the National Center for Biotechnology Information (NCBI) database; around 32 % of the protein sequences in the comprehensive UniProtKB database are currently labeled “unknown.” The missing information includes coarse-grained, low-resolution information such as where protein products are expressed, meta-resolution information, such as what chemical pathways proteins participate in the living cell, and high-resolution information, such as what