Visualizing the decision rules behind the ROC curves: understanding the classification process
- PDF / 15,113,637 Bytes
- 27 Pages / 439.37 x 666.142 pts Page_size
- 18 Downloads / 193 Views
Visualizing the decision rules behind the ROC curves: understanding the classification process Sonia Pérez‑Fernández1 · Pablo Martínez‑Camblor2 · Peter Filzmoser3 · Norberto Corral1 Received: 30 September 2019 / Accepted: 14 October 2020 © Springer-Verlag GmbH Germany, part of Springer Nature 2020
Abstract The receiver operating characteristic (ROC) curve is a graphical method commonly used to study the capacity of continuous variables (markers) to properly classify subjects into one of two groups. The decision made is ultimately endorsed by a classification subset on the space where the marker is defined. In this paper, we study graphical representations and propose visual forms to reflect those classification rules giving rise to the construction of the ROC curve. On the one hand, we use static pictures for displaying the classification regions for univariate markers, which are specially convenient when there is not a monotone relationship between the marker and the likelihood of belonging to one group. In those cases, there are two options to improve the classification accuracy: to allow for more flexibility in the classification rules (for example considering two cutoff points instead of one) or to transform the marker by using a function whose resulting ROC curve is optimal. On the other hand, we propose to build videos for visualizing the collection of subsets when several markers are considered simultaneously. A compilation of techniques for finding a rule that maximizes the area under the ROC curve is included, with a focus on linear combinations. We present a tool for the R software which generates those graphics, and we apply it to one real dataset. The R code is provided as Supplementary Material. Keywords Area under the curve · Classification regions · Graphical animations · Multivariate marker · Receiver operating characteristic curve The authors gratefully acknowledge support by the Grants MTM2015-63971-P from the Spanish Ministerio of Economía y Competitividad and by FC-15-GRUPIN14-101 and Severo Ochoa Grant BP16118 from the Principado de Asturias and Grant from Campus of International Excellence of University of Oviedo (the last two ones for Pérez-Fernández). Electronic supplementary material The online version of this article (https://doi.org/10.1007/s1018 2-020-00385-2) contains supplementary material, which is available to authorized users. * Sonia Pérez‑Fernández [email protected] Extended author information available on the last page of the article
13
Vol.:(0123456789)
S. Pérez‑Fernández et al.
1 Introduction As a supervised learning technique, classification is a statistical method whose final objective is to build a grouping rule based on one or various markers collected in a training dataset where the response variable is also known. With that rule, classifications of new subjects can be done on the basis of their marker values (Nielsen et al. 2009). Going into good classifications is important in many fields such as medical diagnosis, machine learning, data mining or
Data Loading...