Deep learning model with ensemble techniques to compute the secondary structure of proteins

  • PDF / 1,002,093 Bytes
  • 16 Pages / 439.37 x 666.142 pts Page_size
  • 31 Downloads / 185 Views

DOWNLOAD

REPORT


Deep learning model with ensemble techniques to compute the secondary structure of proteins Rayed AlGhamdi1 · Azra Aziz2 · Mohammed Alshehri3 · Kamal Raj Pardasani4 · Tarique Aziz5 Accepted: 18 October 2020 © Springer Science+Business Media, LLC, part of Springer Nature 2020

Abstract Protein secondary structure is the local conformation assigned to protein sequences with the help of its three-dimensional structure. Assigning the local conformation to protein sequences requires much computational work. There exists a vast literature on the protein secondary structure prediction approaches (more than 20 techniques), but to date, none of the existing techniques is entirely accurate. Thus, there is an excellent room for developing new models of protein secondary structure prediction to address the issues of prediction accuracy. In the present study, ensemble techniques such as AdaBoost- and Bagging-based deep learning models are proposed to predict the protein secondary structure. The data from standard datasets, namely CB513, RS126, PTOP742, PSA472, and MANESH, have been used for training and testing purposes. These standard datasets possess less than 25% redundancy. The model is evaluated using performance measures: Q8 and Q3 cross-validation accuracy, class precision, class recall, kappa factor, and testing on a dataset that is not used for training purposes, i.e., blind test. The ensembling technique used along with variability in datasets can remove the bias of each dataset by balancing it and making the features more distinguishable, leading to the improvement in accuracy as compared to the conventional and existing techniques. The proposed model shows an average improvement of ~ 2% and ~ 3% accuracy over the existing methods in a blind test for Q8 and Q3 accuracy. Keywords  Protein secondary structure prediction · Ensemble techniques · AdaBoost · Bagging · Machine learning · Deep learning · Supervised learning

* Rayed AlGhamdi [email protected] Extended author information available on the last page of the article

13

Vol.:(0123456789)



R. AlGhamdi et al.

1 Introduction The knowledge of the secondary structure of proteins is crucial for understanding and assessing the function of genes, protein–protein interactions, and various other molecular mechanisms involved in the health and clinical state of an organism. Predicting the secondary structure directly from the primary protein sequence is still a tedious task. However, using the available knowledge on the three-dimensional protein structure obtained by the technique of X-ray crystallography, the local conformation of proteins can be defined [1]. Different features of a protein obtained can then be input into machine learning algorithms to define a rule that estimates its secondary structure. Various types of parameters like physicochemical, topological, and geometrical parameters have been employed in the past and proposed model to predict the secondary structure of proteins. Every parameter has an individual contribution in the structure of protein