Protein class prediction based on Count Vectorizer and long short term memory

  • PDF / 2,225,587 Bytes
  • 8 Pages / 595.276 x 790.866 pts Page_size
  • 66 Downloads / 160 Views

DOWNLOAD

REPORT


ORIGINAL RESEARCH

Protein class prediction based on Count Vectorizer and long short term memory S. R. Mani Sekhar1



G. M. Siddesh1 • Mithun Raj1 • Sunilkumar S. Manvi2

Received: 24 December 2019 / Accepted: 28 September 2020 Ó Bharati Vidyapeeth’s Institute of Computer Applications and Management 2020

Abstract Proteins class and function prediction is one of the most significant task in computational bioinformatics. The information about the protein functions and class plays a vital role in understanding biological cells and has a great impact on human life in factors such as personalized medicine. The technical advancement in the areas of biological aspects and understanding of biological processes results in features and characteristics of important Proteins. Prediction of amino acid sequence involves prediction of amino sequence folding and its structures from the primary sequence obtained. In this work, Machine learning prediction algorithms have applied for protein class prediction. This method takes consideration of macromolecules of biological significances. Later the solution focuses on the understanding of different protein family, subsequently classify the protein family type sequence. This is achieved through machine learning algorithms Naive Bayes (NB) and Random forest (RF) algorithms with count vectorized feature and LSTM. These algorithms are used to classify the protein family on its protein sequence. Finally, result shows that LSTM predicts the protein class more accurately than the RF, and NB algorithm. LSTM achieves an accuracy of 96% whereas RF & NB with an accuracy of 91% and 86%.

& S. R. Mani Sekhar [email protected] 1

Department of Information Science and Engineering, Ramaiah Institute of Technology, Bengalore, India

2

School of Computing and Information Technology, REVA University, Bengalore, Karnataka, India

Keywords Protein  Protein–protein interactions  Naı¨ve bayes  Features  Random forest  Machine learning  LSTM

1 Introduction All living organisms are composed of cells, behind the functioning of the cells Proteins play a major role due to their important aspects in biological activity and also it is very important to understand their protein functionality. The importance of proteins and its functions in understanding how biological activities can be activated at the molecular level. This kind of understanding helps in development of personalized medicine, betterment of crops and therapeutic interventions and also supports in understanding the technical aspects of biological entities and computer systems. With also overwhelming growth of proteins with unidentified functions. Due to this circumstances it is very difficult to manually identify and predict the functionality and group them to protein family. Many methods have been proposed in order to characterize the protein functionality and essential protein prediction. These kind of techniques are based on fundamental information about proteins that might be depending on their amino acid sequence and also using too