Considerations for feature selection using gene pairs and applications in large-scale dataset integration, novel oncogen

PDF / 2,834,489 Bytes
20 Pages / 595.276 x 790.866 pts Page_size
6 Downloads / 324 Views

RESEARCH

Open Access

Considerations for feature selection using gene pairs and applications in large-scale dataset integration, novel oncogene discovery, and interpretable cancer screening Laura Moody1, Hong Chen1,2 and Yuan-Xiang Pan1,2,3* From The 18th Asia Pacific Bioinformatics Conference Seoul, Korea. 18-20 August 2020

Abstract Background: Advancements in transcriptomic profiling have led to the emergence of new challenges regarding data integration and interpretability. Variability between measurement platforms makes it difficult to compare between cohorts, and large numbers of gene features have encouraged the use black box methods that are not easily translated into biologically and clinically meaningful findings. We propose that gene rankings and algorithms that rely on relative expression within gene pairs can address such obstacles. Methods: We implemented an innovative process to evaluate the performance of five feature selection methods on simulated gene-pair data. Along with TSP, we consider other methods that retain more information in their score calculations, including the magnitude of gene expression change as well as within-class variation. Tree-based rule extraction was also applied to serum microRNA (miRNA) pairs in order to devise a noninvasive screening tool for pancreatic and ovarian cancer. Results: Gene pair data were simulated using different types of signal and noise. Pairs were filtered using feature selection approaches, including top-scoring pairs (TSP), absolute differences between gene ranks, and Fisher scores. Methods that retain more information, such as the magnitude of expression change and within-class variance, yielded higher classification accuracy using a random forest model. We then demonstrate two powerful applications of gene pairs by first performing large-scale integration of 52 breast cancer datasets consisting of 10,350 patients. Not only did we confirm known oncogenes, but we also propose novel tumorigenic genes, such as BSDC1 and U2AF1, that could distinguish between tumor subtypes. Finally, circulating miRNA pairs were filtered and salient rules were extracted to build simplified tree ensemble learners (STELs) for four types of cancer. These accessible clinical frameworks detected pancreatic and ovarian cancer with 84.8 and 93.6% accuracy, respectively. (Continued on next page)

* Correspondence: [email protected] 1 Division of Nutritional Sciences, University of Illinois Urbana-Champaign, 461 Bevier Hall, 905 South Goodwin Avenue, Urbana, IL 61801, USA 2 Department of Food Science and Human Nutrition, University of Illinois Urbana-Champaign, Urbana, IL, USA Full list of author information is available at the end of the article © The Author(s). 2020 Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons l

Data Loading...

Considerations for feature selection using gene pairs and applications in large-scale dataset integration, novel oncogen

Recommend Documents

A Novel Approach to Gene Selection of Leukemia Dataset Using Different Clustering Methods

Performance Analysis of Intrusion Detection Systems Using a Feature Selection Method on the UNSW-NB15 Dataset

Feature Selection for Clustering

Simultaneous feature selection and clustering of micro-array and RNA-sequence gene expression data using multiobjective

Effective Disease Prediction on Gene Family Abundance Using Feature Selection and Binning Approach

A Novel Feature Selection Method for Classification Using a Fuzzy Criterion

A new feature selection using dynamic interaction

A Novel Approach for Ensemble Feature Selection Using Clustering with Automatic Threshold

Recent Advances in Ensembles for Feature Selection

Materials Selection and Design Considerations

Feature Selection for Handwritten Signature Recognition Using Neighborhood Component Analysis

Feature Selection for Data and Pattern Recognition