Assessing the Goodness-of-Fit of Statistical Distributions When Data Are Grouped

  • PDF / 181,550 Bytes
  • 16 Pages / 432 x 648 pts Page_size
  • 39 Downloads / 193 Views

DOWNLOAD

REPORT


Assessing the Goodness-of-Fit of Statistical Distributions When Data Are Grouped1 Judith K. Haschenburger2 and John J. Spinelli3 Modeling statistical distributions of phenomena can be compromised by the choice of goodness-offit statistics. The Pearson chi-square test is the most commonly used test in the geosciences, but the lesser known empirical distribution function (EDF) statistics should be preferred in many test situations. Using a data set from geomorphology, the Anderson–Darling test for grouped exponential distributions is employed to illustrate ease of use and statistical advantages of this EDF test. Attention to the issues discussed will result in more informed statistic selection and increased rigor in the identification of distribution functions that describe random variables. KEY WORDS: Cram´er-von Mises, Anderson–Darling, EDF statistics, Pearson chi-square, grouped exponential distribution.

INTRODUCTION Estimating statistical distributions of geologic phenomena constitutes a common, long-standing pursuit of geoscientists (e.g., Einstein, 1937; Griffiths, 1960; Olson, 1957; Richardson, 1923; Todorovic and Zelenhasic, 1970). As an important first step, the examination of frequency distributions constructed from observed data and the computation of moments provide descriptive information for substantive interpretation, such as the environmental conditions of sediment deposits (Fieller, Gilbertson, and Olbricht, 1984; Krumbein, 1936). However, there is a fundamental need to move beyond description to identify specific probability distributions that permit generalized analysis, modeling, and inference. The establishment of the underlying distributions of random variables involves selecting plausible theoretical distribution functions, estimating the parameters that define these distributions, and objectively evaluating the similarity between empirical and theoretical distributions. Statistical distributions for continuous and 1Received

8 January 2002; accepted 7 September 2004. of Geography and Environmental Science, University of Auckland, Auckland, New Zealand; e-mail: [email protected] 3British Columbia Cancer Agency, Vancouver, British Columbia, Canada V5Z 4E6; e-mail: [email protected] 2School

261 C 2005 International Association for Mathematical Geology 0882-8121/05/0400-0261/1 

262

Haschenburger and Spinelli

discrete data are well documented (e.g., Johnson, Kotz, and Kemp, 1992). Selection of plausible theoretical distributions must be based on the nature of data (continuous or discrete) and consideration of existing statistical theory, previous research findings or, in the case of initial exploratory work, educated guesses that may favor a simple distribution over a more complex one. In some cases, parameter values may be known on theoretical grounds but, more commonly, parameters must be estimated from empirical data. Methods to generate unbiased, efficient estimates of unknown parameters are readily available (e.g., Kendall and Stuart, 1963). Goodness-of-fit statistic