Improving protein fold recognition by random forest

PDF / 644,088 Bytes
7 Pages / 595.276 x 793.701 pts Page_size
92 Downloads / 334 Views

PROCEEDINGS

Open Access

Improving protein fold recognition by random forest Taeho Jo, Jianlin Cheng* From 11th Annual MCBIOS Conference Stillwater, OK, USA. 6-8 March 2014

Abstract Background: Recognizing the correct structural fold among known template protein structures for a target protein (i.e. fold recognition) is essential for template-based protein structure modeling. Since the fold recognition problem can be defined as a binary classification problem of predicting whether or not the unknown fold of a target protein is similar to an already known template protein structure in a library, machine learning methods have been effectively applied to tackle this problem. In our work, we developed RF-Fold that uses random forest - one of the most powerful and scalable machine learning classification methods - to recognize protein folds. Results: RF-Fold consists of hundreds of decision trees that can be trained efficiently on very large datasets to make accurate predictions on a highly imbalanced dataset. We evaluated RF-Fold on the standard Lindahl’s benchmark dataset comprised of 976 × 975 target-template protein pairs through cross-validation. Compared with 17 different fold recognition methods, the performance of RF-Fold is generally comparable to the best performance in fold recognition of different difficulty ranging from the easiest family level, the medium-hard superfamily level, and to the hardest fold level. Based on the top-one template protein ranked by RF-Fold, the correct recognition rate is 84.5%, 63.4%, and 40.8% at family, superfamily, and fold levels, respectively. Based on the top-five template protein folds ranked by RF-Fold, the correct recognition rate increases to 91.5%, 79.3% and 58.3% at family, superfamily, and fold levels. Conclusions: The good performance achieved by the RF-Fold demonstrates the random forest’s effectiveness for protein fold recognition.

Background Proteins are the fundamental functional units in living systems. Protein tertiary (three-dimensional) structures at the molecular level are necessary to understand the functions of proteins. However, due to the significant cost of experimentally determining the tertiary structures of proteins, the number of known 3D protein structures is about 200 times smaller than the number of known protein sequences [1,2]. Therefore, it is important to develop computational methods to predict protein structures from protein sequences [3]. Recognizing a known structure that is similar to the unknown structure (i.e. fold recognition) is an important step of the * Correspondence: [email protected] Department of Computer Science, Informatics Institute, C. Bond Life Science Center, University of Missouri, Columbia, MO 65211, USA

template-based protein structure modeling approach that uses the known structure as a template to construct a structural model for the target protein [4,5]. Since the number of unique protein structures appears to be limited (e.g., several thousand) according to the structural analysis on all the tertiary protein

Data Loading...

Improving protein fold recognition by random forest

Recommend Documents

Protein Fold

Protein Secondary Structure Prediction Using CNN and Random Forest

Death Domain Fold Protein Superfamily

Two-Layer Fuzzy Multiple Random Forest for Speech Emotion Recognition

A MapReduce-Based Parallel Random Forest Approach for Predicting Large-Scale Protein-Protein Interactions

Double random forest

Recognition of Isolated Digit Using Random Forest for Audio-Visual Speech Recognition

DeepFrag-k: a fragment-based deep learning approach for protein fold recognition

Fast Nonnegative Matrix Factorization and Its Application for Protein Fold Recognition

CHIRPS: Explaining random forest classification

Plant Leaf Recognition and Classification Based on the Whale Optimization Algorithm (WOA) and Random Forest (RF)

Improving Gesture Recognition by Bidirectional Temporal Convolutional Netwoks