Analysis of type I and II error rates of Bayesian and frequentist parametric and nonparametric two-sample hypothesis tes

  • PDF / 1,534,909 Bytes
  • 26 Pages / 439.37 x 666.142 pts Page_size
  • 2 Downloads / 203 Views

DOWNLOAD

REPORT


Analysis of type I and II error rates of Bayesian and frequentist parametric and nonparametric two-sample hypothesis tests under preliminary assessment of normality Riko Kelter1 Received: 26 January 2020 / Accepted: 8 September 2020 © The Author(s) 2020

Abstract Testing for differences between two groups is among the most frequently carried out statistical methods in empirical research. The traditional frequentist approach is to make use of null hypothesis significance tests which use p values to reject a null hypothesis. Recently, a lot of research has emerged which proposes Bayesian versions of the most common parametric and nonparametric frequentist two-sample tests. These proposals include Student’s two-sample t-test and its nonparametric counterpart, the Mann–Whitney U test. In this paper, the underlying assumptions, models and their implications for practical research of recently proposed Bayesian two-sample tests are explored and contrasted with the frequentist solutions. An extensive simulation study is provided, the results of which demonstrate that the proposed Bayesian tests achieve better type I error control at slightly increased type II error rates. These results are important, because balancing the type I and II errors is a crucial goal in a variety of research, and shifting towards the Bayesian two-sample tests while simultaneously increasing the sample size yields smaller type I error rates. What is more, the results highlight that the differences in type II error rates between frequentist and Bayesian two-sample tests depend on the magnitude of the underlying effect. Keywords Bayesian hypothesis testing · Two-sample hypothesis tests · Null hypothesis significance testing · Parametric and non-parametric two-sample tests · Type I and II error rates

1 Introduction In a lot of quantitative research like the medical and social sciences, two-sample tests like Student’s t-test are among the most widely carried out statistical procedures

B 1

Riko Kelter [email protected] Department of Mathematics, University of Siegen, Walter-Flex-Street 3, 57072 Siegen, Germany

123

R. Kelter

(Nuijten et al. 2016). In randomized controlled trials (RCT), the goal often is to test the efficacy of a new treatment or drug and find out the size of an effect. In usual study designs, a treatment and control group are used and differences in a response variable like the blood pressure or cholesterol level between both groups are recorded. The gold standard for deciding if the new treatment or drug was effective compared to the status quo treatment or drug is the p value, which is the probability, under the null hypothesis H0 , of obtaining a difference equal to or more extreme than the difference observed. The dominance of p values when comparing two groups in medical (and other) research is striking: For example, Nuijten et al. (2016) showed in a large-scale meta-analysis that of 258105 p values reported in journals between 1985 until 2013, 26% belonged to a t-statistic. Besides the importance of two-sample tests