A machine learning-based clinical tool for diagnosing myopathy using multi-cohort microarray expression profiles

  • PDF / 1,192,186 Bytes
  • 9 Pages / 595.276 x 790.866 pts Page_size
  • 76 Downloads / 182 Views

DOWNLOAD

REPORT


Journal of Translational Medicine Open Access

RESEARCH

A machine learning‑based clinical tool for diagnosing myopathy using multi‑cohort microarray expression profiles Andrew Tran1†, Chris J. Walsh2,3†, Jane Batt2,4, Claudia C. dos Santos2,4* and Pingzhao Hu1,5,6* 

Abstract  Background:  Myopathies are a heterogenous collection of disorders characterized by dysfunction of skeletal muscle. In practice, myopathies are frequently encountered by physicians and precise diagnosis remains a challenge in primary care. Molecular expression profiles show promise for disease diagnosis in various pathologies. We propose a novel machine learning-based clinical tool for predicting muscle disease subtypes using multi-cohort microarray expression data. Materials and methods:  Muscle tissue samples originating from 1260 patients with muscle weakness. Data was curated from 42 independent cohorts with expression profiles in public microarray gene expression repositories, which represent a broad range of patient ages and peripheral muscles. Cohorts were categorized into five muscle disease subtypes: immobility, inflammatory myopathies, intensive care unit acquired weakness (ICUAW), congenital, and chronic systemic disease. The data contains expression data on 34,099 genes. Data augmentation techniques were used to address class imbalances in the muscle disease subtypes. Support vector machine (SVM) models were trained on two-thirds of the 1260 samples based on the top selected gene signature using analysis of variance (ANOVA). The model was validated in the remaining samples using area under the receiver operator curve (AUC). Gene enrichment analysis was used to identify enriched biological functions in the gene signature. Results:  The AUC ranges from 0.611 to 0.649 in the observed imbalanced data. Overall, using the augmented data, chronic systemic disease was the best predicted class with AUC 0.872 (95% confidence interval (CI): 0.824–0.920). The least discriminated classes were ICUAW with AUC 0.777 (95% CI: 0.668–0.887) and immobility with AUC 0.789 (95% CI: 0.716–0.861). Disease-specific gene set enrichment results showed that the gene signature was enriched in biological processes including neural precursor cell proliferation for ICUAW and aerobic respiration for congenital (false discovery rate q-value   10,000 genes, (3) the probe-togene mapping annotations were clear, (4) there were 5 >  = cases and 5 >  = controls in each dataset, and (5) the controls were derived from healthy muscle tissue. Samples taken after intervention (e.g. after cancer resection) were excluded. The datasets were classified into 5 muscle disease categories: immobility, inflammatory myopathies, ICU acquired weakness (ICUAW), congenital, and chronic systemic disease. All analyses carried out in this study were performed using R Version 3.6.3.

Tran et al. J Transl Med

(2020) 18:454

Page 3 of 9

Fig. 1  Model training and validation workflow. The original, augmented, and combined expression profile data are referred to as T0, T1, and T2 respectively.