Predicting subcellular localization of multi-location proteins by improving support vector machines with an adaptive-dec

  • PDF / 1,323,426 Bytes
  • 13 Pages / 595.276 x 790.866 pts Page_size
  • 90 Downloads / 150 Views

DOWNLOAD

REPORT


ORIGINAL ARTICLE

Predicting subcellular localization of multi-location proteins by improving support vector machines with an adaptive-decision scheme Shibiao Wan1 • Man-Wai Mak1

Received: 25 March 2015 / Accepted: 3 November 2015 Ó Springer-Verlag Berlin Heidelberg 2015

Abstract From the perspective of machine learning, predicting subcellular localization of multi-location proteins is a multi-label classification problem. Conventional multi-label classifiers typically compare some patternmatching scores with a fixed decision threshold to determine the number of subcellular locations in which a protein will reside. This simple strategy, however, may easily lead to over-prediction due to a large number of false positives. To address this problem, this paper proposes a more powerful multi-label predictor, namely AD–SVM, which incorporates an adaptive-decision (AD) scheme into multilabel support vector machine (SVM) classifiers. Specifically, given a query protein, a term-frequency based gene ontology vector is constructed by successively searching the gene ontology annotation database. Subsequently, the feature vector is classified by AD–SVM, which extends the binary relevance method with an adaptive decision scheme that essentially converts the linear SVMs to piecewise linear SVMs. Experimental results suggest that AD–SVM outperforms existing state-of-the-art multi-location predictors by at least 4 % (absolute) for a stringent virus dataset and 1 % (absolute) for a stringent plant dataset, respectively. Results also show that the adaptivedecision scheme can effectively reduce over-prediction while having insignificant effect on the correctly predicted ones.

& Shibiao Wan [email protected] & Man-Wai Mak [email protected] 1

Department of Electronic and Information Engineering, The Hong Kong Polytechnic University, Hong Kong, SAR, China

Keywords Adaptive decisions  Multi-label classification  Protein subcellular localization  Support vector machines

1 Introduction Conventionally, predicting where a protein resides within a cell is a single-label classification problem, where each protein is assumed to be associated with one of the known subcellular locations only. These approaches are generally divided into two categories: (1) sequence-based methods, such as amino-acid composition methods [6, 40, 74], sorting-signal methods [15, 39, 41] and homology-based methods [31, 35] and (2) knowledge-based methods, such as gene ontology (GO)1 based methods [8, 9, 57, 58], PubMed abstracts based methods [5, 18] and Swiss-Prot keywords [30, 38] based methods. The focus on predicting single-location proteins is driven by the large amount of data available in public databases such as UniProt, where a majority of proteins are assigned to a single location. However, it is untenable to exclude the multi-location proteins or assume that multi-location proteins do not exist, because recent studies [16, 33, 37, 73] show that there exist multi-location proteins that can simultaneously reside at, or move between, two o