Improvement the performance of the classification models of Cyclooxygenase-2 inhibitors using undersampling methods base
- PDF / 2,183,927 Bytes
- 30 Pages / 439.37 x 666.142 pts Page_size
- 29 Downloads / 148 Views
Improvement the performance of the classification models of Cyclooxygenase‑2 inhibitors using undersampling methods based on the rivality and reliability indexes Irene Luque Ruiz1 · Miguel Ángel Gómez‑Nieto1 Received: 4 August 2020 / Accepted: 19 September 2020 © Springer Nature Switzerland AG 2020
Abstract Undersampling, prototype or instance selection techniques are oriented to remove redundant and noisy molecules from the datasets in order to reduce the computational cost and the necessity of memory in the construction of QSAR models, maintaining the performance of the models. In this paper, we describe and apply an undersampling technique based on the rivality and reliability indexes to the building of classification models for two Cyclooxinase-2 inhibitors datasets. In a preprocessing stage, the datasets are analyzed and curated and classification models are built using Support Vector Machine, Random Forest and Rivality Index Neighborhood algorithms. The results obtained clearly improve the ones described in the literature for these datasets for the training models and external validations carried out. Values of Matthews Correlation Coefficient higher than 0.9 for the training models and external validations proved the high robustness of the models generated using the undersampling technique capable of generating reductions of the datasets greater than 80%. Keywords Undersampling technique · QSAR · Classification algorithms · RINH algorithm · Rivality index · Reliability index
* Irene Luque Ruiz [email protected] Miguel Ángel Gómez‑Nieto [email protected] 1
Department of Computing and Numerical Analysis, University of Córdoba, Campus de Rabanales, Albert Einstein Building, 14071 Córdoba, Spain
13
Vol.:(0123456789)
Journal of Mathematical Chemistry
1 Introduction The development of Quantitative structure–activity relationship (QSAR) classification models requires of a preprocessing stage in charge of preparing the dataset for the correct, reliable, interpretable and reproducible construction of a mathematical model capable of predicting the activity of the molecules of a dataset [1–3]. This preprocessing stage is part of the curation of the dataset, where the correctness and the dimensionality of the data representation is analyzed. Data dimensionality can be seen from two points of view: (1) a large number of features used for the description of each molecule, and (2) a large number of molecules existing in the training set [4–9]. Large data dimensionality conducts to models of difficult interpretation. Consequently, techniques of feature selection have been proposed in order to reduce the number of variables representing the structural characteristics of the molecules of the dataset, maintaining the performance of the classification models. Correctness of the data is related with the values of the features and activity of the molecules of the dataset. Erroneous, inexistent, redundant, etc., variables or molecules should be removed from the space representation of the dataset. In addition, a large numb
Data Loading...