A New Subcellular Localization Predictor for Human Proteins Considering the Correlation of Annotation Features and Prote

Identifying a protein’s subcellular localization is meaningful to understand the function of the protein. While experimental method to identify the subcellular localization of proteins will cost a lot of time, it is necessary to utilize computational appr

  • PDF / 809,487 Bytes
  • 14 Pages / 439.37 x 666.14 pts Page_size
  • 71 Downloads / 171 Views

DOWNLOAD

REPORT


)

(

)

1 Key Laboratory of System Control and Information Processing, Institute of Image Processing and Pattern Recognition, Shanghai Jiao Tong University, Ministry of Education of China, Shanghai, China {zhouhang2,hbshen}@sjtu.edu.cn Department of Computer Science, Shanghai Jiao Tong University, Shanghai 200240, China [email protected]

Abstract. Identifying a protein’s subcellular localization is meaningful to under‐ stand the function of the protein. While experimental method to identify the subcellular localization of proteins will cost a lot of time, it is necessary to utilize computational approaches for dealing with large scale proteins of unknown loca‐ tion. Current predictors mostly consider the annotation-based features but few of them take their correlation into account. Moreover, most of predictors can only deal with single-locational proteins, while a lot of proteins bear multi-locational characteristics, which play important roles in many biological processes. In this paper, we propose a novel prediction method, which extracts features from prior biological knowledge by considering the correlation between annotation terms. The new method can also deal with the multi-localization problem. We compared the performance of the proposed method with other predictors on four datasets. The result shows that our method is outperform than others. Keywords: Subcellular localization · Multi-label · Correlation · Gene Ontology

1

Introduction

The information of protein subcellular localization is crucial for understanding molec‐ ular function and related biological process of proteins. Since it is labor-intensive and time-consuming to identify a protein’s cellular compartment by biological experiments, in-silico tools for the prediction of locations are of great necessity in addressing large scale data sets of proteins with unknown locations. According to SWISS-PROT knowl‐ edgebase [1] released in January 2012, among the total of 534242 proteins, only 66203 proteins have defined subcellular localization annotations while 247504 proteins have uncertain location annotations. Machine learning-based computational tools, which allow automatic prediction for the proteins with unknown locations by utilizing available subcellular location annotations, have been largely developed for the last decade. More‐ over, as protein sequences and various annotation data grow rapidly in public databases, © Springer Nature Singapore Pte Ltd. 2016 T. Tan et al. (Eds.): CCPR 2016, Part II, CCIS 663, pp. 499–512, 2016. DOI: 10.1007/978-981-10-3005-5_41

500

H. Zhou et al.

more available information could be used in computational tools to provide more precise predictions, especially for some difficult issues, such as the locations with very few known examples, or the proteins with multiple locations. The computational prediction methods mainly consist of two types of features. One is annotation-based and the other is sequence-based. Sequence-based features include amino acid composition [2, 3], amino acid pair [4, 5], pseudo-amino