Interpretation of machine learning models using shapley values: application to compound potency and multi-target activit

  • PDF / 2,773,423 Bytes
  • 14 Pages / 595.276 x 790.866 pts Page_size
  • 62 Downloads / 178 Views

DOWNLOAD

REPORT


Interpretation of machine learning models using shapley values: application to compound potency and multi‑target activity predictions Raquel Rodríguez‑Pérez1 · Jürgen Bajorath1  Received: 6 March 2020 / Accepted: 24 April 2020 © The Author(s) 2020

Abstract Difficulties in interpreting machine learning (ML) models and their predictions limit the practical applicability of and confidence in ML in pharmaceutical research. There is a need for agnostic approaches aiding in the interpretation of ML models regardless of their complexity that is also applicable to deep neural network (DNN) architectures and model ensembles. To these ends, the SHapley Additive exPlanations (SHAP) methodology has recently been introduced. The SHAP approach enables the identification and prioritization of features that determine compound classification and activity prediction using any ML model. Herein, we further extend the evaluation of the SHAP methodology by investigating a variant for exact calculation of Shapley values for decision tree methods and systematically compare this variant in compound activity and potency value predictions with the model-independent SHAP method. Moreover, new applications of the SHAP analysis approach are presented including interpretation of DNN models for the generation of multi-target activity profiles and ensemble regression models for potency prediction. Keywords  Machine learning · Black box character · Structure–activity relationships · Compound activity · Compound potency prediction · Multi-target modeling · Model interpretation · Feature importance · Shapley values

Introduction Major tasks for machine learning (ML) in chemoinformatics and medicinal chemistry include predicting new bioactive small molecules or the potency of active compounds [1–4]. Typically, such predictions are carried out on the basis of molecular structure, more specifically, using computational descriptors calculated from molecular graph representations or conformations. For activity prediction, ML models are trained to systematically associate structural patterns, represented in more or less abstract forms, with known biological activities of small molecules. Classification models are derived for predicting class labels of test compounds (e.g., active/inactive or highly/weakly potent) whereas regression

* Jürgen Bajorath [email protected]‑bonn.de 1



Department of Life Science Informatics, B‑IT, LIMES Program Unit Chemical Biology and Medicinal Chemistry, Rheinische Friedrich-Wilhelms-Universität, Endenicher Allee 19c, 53115 Bonn, Germany

models predict numerical potency values. Supervised ML can also be applied to predict other molecular properties. Understanding model decisions is generally relevant for assessing the consistency of predictions and detecting potential sources of model bias. Interpretability is also crucial for extracting knowledge from modeling efforts. Accordingly, there is high interest in better understanding the basis of correct ML predictions or failures [5–9]. For example, in structure–activity relat