Variable selection from a feature representing protein sequences: a case of classification on bacterial type IV secreted
- PDF / 1,858,060 Bytes
- 15 Pages / 595.276 x 790.866 pts Page_size
- 60 Downloads / 173 Views
METHODOLOGY ARTICLE
Open Access
Variable selection from a feature representing protein sequences: a case of classification on bacterial type IV secreted effectors Jian Zhang1, Lixin Lv1, Donglei Lu1, Denan Kong2, Mohammed Abdoh Ali Al‑Alashaari2 and Xudong Zhao2*
*Correspondence: [email protected] 2 College of Information and Computer Engineering, Northeast Forestry University, No. 26 Hexing Road, Harbin 150040, China Full list of author information is available at the end of the article
Abstract Background: Classification of certain proteins with specific functions is momentous for biological research. Encoding approaches of protein sequences for feature extrac‑ tion play an important role in protein classification. Many computational methods (namely classifiers) are used for classification on protein sequences according to vari‑ ous encoding approaches. Commonly, protein sequences keep certain labels corre‑ sponding to different categories of biological functions (e.g., bacterial type IV secreted effectors or not), which makes protein prediction a fantasy. As to protein prediction, a kernel set of protein sequences keeping certain labels certified by biological experi‑ ments should be existent in advance. However, it has been hardly ever seen in prevail‑ ing researches. Therefore, unsupervised learning rather than supervised learning (e.g. classification) should be considered. As to protein classification, various classifiers may help to evaluate the effectiveness of different encoding approaches. Besides, variable selection from an encoded feature representing protein sequences is an important issue that also needs to be considered. Results: Focusing on the latter problem, we propose a new method for variable selec‑ tion from an encoded feature representing protein sequences. Taking a benchmark dataset containing 1947 protein sequences as a case, experiments are made to identify bacterial type IV secreted effectors (T4SE) from protein sequences, which are com‑ posed of 399 T4SE and 1548 non-T4SE. Comparable and quantified results are obtained only using certain components of the encoded feature, i.e., position-specific scoring matix, and that indicates the effectiveness of our method. Conclusions: Certain variables other than an encoded feature they belong to do work for discrimination between different types of proteins. In addition, ensemble classifiers with an automatic assignment of different base classifiers do achieve a better classifica‑ tion result. Keywords: Feature selection, Variable importance, Accumulated scoring, Classification, Bacterial type IV secreted effectors
© The Author(s) 2020. This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article
Data Loading...