Evaluation of Diagnostic Tests: Measuring Degree of Agreement and Beyond
- PDF / 809,404 Bytes
- 12 Pages / 504 x 719.759 pts Page_size
- 54 Downloads / 217 Views
0092-8615/2001 Copyright 0 2001 Drug Information Association Inc.
EVALUATION OF DIAGNOSTIC TESTS: MEASURING DEGREE OF AGREEMENT AND BEYOND* T. S. WENG,PHD Food and Drug Administration, Center for Devices and Radiological Health, Office of Surveillance and Biornetrics, Division of Biostatistics, Rockville, Maryland
When evaluating a new diagnostic test against a less than perfect “gold standard, ” the kappa coefficient of agreement K is often inappropriately used as CI measure of “diagnostic accuracy,” which frequently leads to paradoxical findings. In this paper, K is expressed as a function of disease prevalence and diagnostic accuracy (subject to Youden’s index > 0), whereby necessary and sufficient conditions, given the accuracy rates, are derived to aid in locating the maximizer of K. Paradoxical behavior of K can thus be detected in the light of diagnostic accuracy. Attempts are made to clarih the subtle difference between “diagnostic accuracy” and “diagnostic reliability *’ The implication of this difference is then assessed from a regulatory perspective. In order to extend the idea of K beyond its originally intended use, the maximum likelihood method, coupled with the Expectation-Maximization algorithm, is proposed as a remedial option, not for measuring diagnostic agreement or reliability but, rather, for evaluating diagnostic accuracy. Some illustrative examples adapted from published data are provided. Key Words: Diagnosis; Kappa; Agreement; Prevalence; Accuracy
THE PROBLEM IN THE ABSENCE OF A perfect gold standard, the kappa statistic K is extensively used as an omnibus, chance-adjusted, scalar index summarizing a 2 x 2 table of binary agreement between two diagnostic tests made on the same subject. Specifically, let us consider testing for a disease D by applying two diagnostic tests, Test 1 and Test 2, to n subjects. Suppose each testee will respond to each test
*This investigation was supported by the United States Food and Drug Administration’s OSBlODE Review Science Research Program. The views expressed here are those of the writer and not necessarily those of the United States Food and Drug Administration. Reprint address: T. S. Weng, PhD, Division of Biostatistics, Office of Surveillance & Biometrics. Center for Devices and Radiological Health, Food and Drug Administration, 1350 Piccard Drive, Rockville, MD 20850.
with either a “positive (+)” or a “negative (-)” outcome. The frequency distribution of the subjects’ response pattern can be summarized in a 2 x 2 contingency table, as in Table 1. Intuitively, if the degree of agreement between the two tests is high, then the diagonal entries a and d would be greater than the offdiagonal entries b and c. As a measure of chance-adjusted agreement, Cohen (1) proposed the kappa statistic
where
Po = (a + d)/n
(2)
is the proportion of times the two tests agree, and
577
pe = [(a + b)(a + c) + (b + d)(c + d)]/n’ (3)
Downloaded from dij.sagepub.com at UNIV NEBRASKA LIBRARIES on March 18, 2015
5 78
T. S. Weng
TABLE 1 Frequency Distribution of S
Data Loading...