Challenges of Evaluating the Quality of Software Engineering Experiments

Good-quality experiments are free of bias. Bias is considered to be related to internal validity (e.g., how well experiments are planned, designed, executed, and analysed). Quality scales and expert opinion are two approaches for assessing the quality of

  • PDF / 467,123 Bytes
  • 19 Pages / 439.36 x 666.15 pts Page_size
  • 79 Downloads / 193 Views

DOWNLOAD

REPORT


Abstract

Good-quality experiments are free of bias. Bias is considered to be related to internal validity (e.g., how well experiments are planned, designed, executed, and analysed). Quality scales and expert opinion are two approaches for assessing the quality of experiments. Aim: Identify whether there is a relationship between bias and quality scale and expert opinion predictions in SE experiments. Method: We used a quality scale to determine the quality of 35 experiments from three systematic literature reviews. We used two different procedures (effect size and response ratio) to calculate the bias in diverse response variables for the above experiments. Experienced researchers assessed the quality of these experiments. We analysed the correlations between the quality scores, bias and expert opinion. Results: The relationship between quality scales, expert opinion and bias depends on the technology exercised in the experiments. The correlation between quality scales, expert opinion and bias is only correct when the technologies can be subjected to acceptable experimental control. Both correct and incorrect expert ratings are more extreme than the quality scales. Conclusions: A quality scale based on formal internal quality criteria will predict bias satisfactorily provided that the technology can be properly controlled in the laboratory.

1

Introduction

According to Kitchenham [1], the SLR process involves: (1) identifying experiments about a particular research topic, (2) selecting the studies relevant to the research, (3) including/excluding studies based on their quality, (4) extracting the data from

O. Dieste () • N. Juristo Universidad Polit´ecnica de Madrid, 28660 Boadilla del Monte, Madrid, Spain e-mail: [email protected]; [email protected] J. M¨unch and K. Schmid (eds.), Perspectives on the Future of Software Engineering, DOI 10.1007/978-3-642-37395-4 11, © Springer-Verlag Berlin Heidelberg 2013

159

160

O. Dieste and N. Juristo

the included studies, and (5) aggregating the data to generate pieces of knowledge. The quality assessment (QA) step acts like a filter during which the quality of primary studies is assessed and the passage of poor quality experiments to the data extraction and synthesis phases is blocked. QA aims to make the review process more efficient and less error-prone. It is generally accepted that a good-quality experiment is free of bias. Freedom from bias is the result of careful planning and appropriate control during design and operation, which maximises the experiment’s internal validity [2]. As bias cannot be measured, quality assurance (QA) instruments are designed to assess the internal validity of experiments and infer the quality of the experiment from this assessment [2]. Checklists and quality scales are generally used for this purpose. Following the guidelines for other disciplines, Kitchenham [1] and Biolchini et al. [3] recommend a detailed QA of SE studies during the SLR process. These papers were followed by Dyb˚a and Dingsøyr’s proposal [4], which they applied