A comparison of regularized logistic regression and random forest machine learning models for daytime diagnosis of obstr

  • PDF / 1,226,298 Bytes
  • 13 Pages / 595.276 x 790.866 pts Page_size
  • 6 Downloads / 191 Views

DOWNLOAD

REPORT


ORIGINAL ARTICLE

A comparison of regularized logistic regression and random forest machine learning models for daytime diagnosis of obstructive sleep apnea Farahnaz Hajipour 1

&

Mohammad Jafari Jozani 2 & Zahra Moussavi 1,3

Received: 26 October 2019 / Accepted: 23 May 2020 # International Federation for Medical and Biological Engineering 2020

Abstract A major challenge in big and high-dimensional data analysis is related to the classification and prediction of the variables of interest by characterizing the relationships between the characteristic factors and predictors. This study aims to assess the utility of two important machine-learning techniques to classify subjects with obstructive sleep apnea (OSA) using their daytime tracheal breathing sounds. We evaluate and compare the performance of the random forest (RF) and regularized logistic regression (LR) as feature selection tools and classification approaches for wakefulness OSA screening. Results show that the RF, which is a lowvariance committee-based approach, outperforms the regularized LR in terms of blind-testing accuracy, specificity, and sensitivity with 3.5%, 2.4%, and 3.7% improvement, respectively. However, the regularized LR was found to be faster than the RF and resulted in a more parsimonious model. Consequently, both the RF and regularized LR feature reduction and classification approaches are qualified to be applied for the daytime OSA screening studies, depending on the nature of data and applications’ purposes.

Keywords Feature selection . Classification . Regularized logistic regression . LASSO . Random forest . Obstructive sleep apnea Abbreviations AHI Apnea-Hypopnea Index ANOVA Analysis of variance AUC Area under the curve CI Confidence interval LASSO Least absolute shrinkage and selection operator LR Logistic regression MANOVA Multivariate analysis of variance NC Neck circumference OSA Obstructive sleep apnea OOB Out-of-bag PSD Power spectrum density PSG Polysomnography

* Farahnaz Hajipour [email protected] 1

Biomedical Engineering Program, University of Manitoba, Winnipeg, Canada

2

Department of Statistics, University of Manitoba, Winnipeg, Canada

3

Electrical and Computer Engineering Department, University of Manitoba, Winnipeg, Canada

ROC RF TBS

Receiver operating characteristics Random forest Tracheal breathing sounds

1 Introduction Nowadays, the world is in the “Big Data” era, as most available data are stored [1]. For example, in medical fields, the stored data includes patients’ personal, family, and demographic information; history of their diseases; and their various medical tests. Big Data analysis requires a fair knowledge of the data being processed and proper use of intelligent algorithms to extract appropriate knowledge from the data regarding the relationships between predictors and variables of interest, and perform classification and prediction. When dealing with large and high-dimensional datasets, it is possible to extract a considerable number of features from data. To build parsimonious models tha