Assessing and mitigating the effects of class imbalance in machine learning with application to X-ray imaging

  • PDF / 820,653 Bytes
  • 8 Pages / 595.276 x 790.866 pts Page_size
  • 22 Downloads / 204 Views

DOWNLOAD

REPORT


ORIGINAL ARTICLE

Assessing and mitigating the effects of class imbalance in machine learning with application to X-ray imaging Wendi Qu1 · Indranil Balki1 · Mauro Mendez1 · John Valen1 · Jacob Levman2,3 · Pascal N. Tyrrell1,4,5 Received: 8 May 2020 / Accepted: 4 September 2020 © CARS 2020

Abstract Purpose Machine learning (ML) algorithms are well known to exhibit variations in prediction accuracy when provided with imbalanced training sets typically seen in medical imaging (MI) due to the imbalanced ratio of pathological and normal cases. This paper presents a thorough investigation of the effects of class imbalance and methods for mitigating class imbalance in ML algorithms applied to MI. Methods We first selected five classes from the Image Retrieval in Medical Applications (IRMA) dataset, performed multiclass classification using the random forest model (RFM), and then performed binary classification using convolutional neural network (CNN) on a chest X-ray dataset. An imbalanced class was created in the training set by varying the number of images in that class. Methods tested to mitigate class imbalance included oversampling, undersampling, and changing class weights of the RFM. Model performance was assessed by overall classification accuracy, overall F1 score, and specificity, recall, and precision of the imbalanced class. Results A close-to-balanced training set resulted in the best model performance, and a large imbalance with overrepresentation was more detrimental to model performance than underrepresentation. Oversampling and undersampling methods were both effective in mitigating class imbalance, and efficacy of oversampling techniques was class specific. Conclusion This study systematically demonstrates the effect of class imbalance on two public X-ray datasets on RFM and CNN, making these findings widely applicable as a reference. Furthermore, the methods employed here can guide researchers in assessing and addressing the effects of class imbalance, while considering the data-specific characteristics to optimize imbalance mitigating methods. Keywords Machine learning · Medical imaging · Class imbalance · Radiology · X-ray

Introduction Electronic supplementary material The online version of this article (https://doi.org/10.1007/s11548-020-02260-6) contains supplementary material, which is available to authorized users.

B

Pascal N. Tyrrell [email protected]

1

Department of Medical Imaging, University of Toronto, Toronto, ON M5T 1W7, Canada

2

Department of Mathematics, Statistics and Computer Science, St Francis Xavier University, Antigonish, NS, Canada

3

Boston Children’s Hospital, Harvard Medical School, Boston, MA, USA

4

Department of Statistical Sciences, University of Toronto, Toronto, ON, Canada

5

Institute of Medical Science, University of Toronto, Toronto, ON, Canada

Machine learning (ML) models rely on training datasets to learn from labelled data in order to make predictions on new unlabelled data. Class imbalance refers to differences in the number of samples represen