Improving lung cancer risk stratification leveraging whole transcriptome RNA sequencing and machine learning across mult

  • PDF / 4,622,866 Bytes
  • 15 Pages / 595.276 x 790.866 pts Page_size
  • 97 Downloads / 176 Views

DOWNLOAD

REPORT


RESEARCH

Open Access

Improving lung cancer risk stratification leveraging whole transcriptome RNA sequencing and machine learning across multiple cohorts Yoonha Choi1†, Jianghan Qu1†, Shuyang Wu1, Yangyang Hao1, Jiarui Zhang2, Jianchang Ning1, Xinwu Yang1, Lori Lofaro1, Daniel G. Pankratz1, Joshua Babiarz1, P. Sean Walsh1, Ehab Billatos2, Marc E. Lenburg2, Giulia C. Kennedy1, Jon McAuliffe3 and Jing Huang1* From The 18th Asia Pacific Bioinformatics Conference Seoul, Korea. 18-20 August 2020

Abstract Background: Bronchoscopy for suspected lung cancer has low diagnostic sensitivity, rendering many inconclusive results. The Bronchial Genomic Classifier (BGC) was developed to help with patient management by identifying those with low risk of lung cancer when bronchoscopy is inconclusive. The BGC was trained and validated on patients in the Airway Epithelial Gene Expression in the Diagnosis of Lung Cancer (AEGIS) trials. A modern patient cohort, the BGC Registry, showed differences in key clinical factors from the AEGIS cohorts, with less smoking history, smaller nodules and older age. Additionally, we discovered interfering factors (inhaled medication and sample collection timing) that impacted gene expressions and potentially disguised genomic cancer signals. Methods: In this study, we leveraged multiple cohorts and next generation sequencing technology to develop a robust Genomic Sequencing Classifier (GSC). To address demographic composition shift and interfering factors, we synergized three algorithmic strategies: 1) ensemble of clinical dominant and genomic dominant models; 2) development of hierarchical regression models where the main effects from clinical variables were regressed out prior to the genomic impact being fitted in the model; and 3) targeted placement of genomic and clinical interaction terms to stabilize the effect of interfering factors. The final GSC model uses 1232 genes and four clinical covariates – age, pack-years, inhaled medication use, and specimen collection timing. (Continued on next page)

* Correspondence: [email protected] † Yoonha Choi and Jianghan Qu contributed equally to this work. 1 Veracyte, Inc., South San Francisco, CA 94080, USA Full list of author information is available at the end of the article © The Author(s). 2020 Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.