On the time-based conclusion stability of cross-project defect prediction models
- PDF / 6,153,656 Bytes
- 38 Pages / 439.642 x 666.49 pts Page_size
- 70 Downloads / 152 Views
On the time-based conclusion stability of cross-project defect prediction models Abdul Ali Bangash1 · Hareem Sahar1
· Abram Hindle1 · Karim Ali1
© Springer Science+Business Media, LLC, part of Springer Nature 2020
Abstract Researchers in empirical software engineering often make claims based on observable data such as defect reports. Unfortunately, in many cases, these claims are generalized beyond the data sets that have been evaluated. Will the researcher’s conclusions hold a year from now for the same software projects? Perhaps not. Recent studies show that in the area of Software Analytics, conclusions over different data sets are usually inconsistent. In this article, we empirically investigate whether conclusions in the area of cross-project defect prediction truly exhibit stability throughout time or not. Our investigation applies a timeaware evaluation approach where models are trained only on the past, and evaluations are executed only on the future. Through this time-aware evaluation, we show that depending on which time period we evaluate defect predictors, their performance, in terms of F-Score, the area under the curve (AUC), and Mathews Correlation Coefficient (MCC), varies and their results are not consistent. The next release of a product, which is significantly different from its prior release, may drastically change defect prediction performance. Therefore, without knowing about the conclusion stability, empirical software engineering researchers should limit their claims of performance within the contexts of evaluation, because broad claims about defect prediction performance might be contradicted by the next upcoming release of a product under analysis. Keywords Conclusion stability · Defect prediction · Time-aware evaluation
Communicated by: Romain Robbes Abdul Ali Bangash
[email protected] Hareem Sahar [email protected] Abram Hindle [email protected] Karim Ali [email protected] 1
Department of Computing Science, University of Alberta, Edmonton, AB, Canada
Empirical Software Engineering
1 Introduction Defect prediction models are trained for predicting future software bugs using historical defect data available in software archives and relating it to predictors such as structural metrics (Chidamber and Kemerer 1994; Martin 1994; Tang et al. 1999), change entropy metrics (Hassan 2009), or process metrics (Mockus and Weiss 2000). The accuracy of defect prediction models is estimated using defect data from a specific time period in the evolution of software, but the models do not necessarily generalize across other time periods. Conclusion stability is the property that a conclusion, i.e., the estimate of performance, remains stable as contexts, such as time of evaluation, change. For example, if the conclusion of a current evaluation of a model on a software product is the same as that of an evaluation done a year ago, then we consider that conclusion to be stable. A lack of conclusion stability would be if the model’s performance is inconsistent with itself across time.
Data Loading...