Effective Disease Prediction on Gene Family Abundance Using Feature Selection and Binning Approach

Metagenomic is now a novel source for supporting diagnosis and prognosis human diseases. Numerous studies have pointed to crucial roles of metagenomics in personalized medicine approaches. Recent years, machine learning has been widely deploying in a vast

PDF / 405,654 Bytes
10 Pages / 439.37 x 666.142 pts Page_size
72 Downloads / 313 Views

DOWNLOAD

REPORT

Abstract Metagenomic is now a novel source for supporting diagnosis and prognosis human diseases. Numerous studies have pointed to crucial roles of metagenomics in personalized medicine approaches. Recent years, machine learning has been widely deploying in a vast amount of metagenomic research. Usually, gene family data are characterized by very high dimension which can be up to millions of features. However, the number of obtained samples is rather small compared to the number of attributes. Therefore, the results in validation sets often exhibit poor performance while we can get high accuracy during training phrases. Moreover, a very large number of features on each gene family dataset consumes a considerable time in processing and learning. In this study, we propose feature selection methods using Ridge Regression on datasets including gene families, then the new obtained set of features is binned by an equal width binning approach and fetched into either a Linear Regression and a One-Dimensional Convolutional Neural Network (CNN1D) to do prediction tasks. The experiments are examined on more than 1000 samples of gene family abundance datasets related to Liver Cirrhosis, Colorectal Cancer, Inflammatory Bowel Disease, Obesity and Type 2 Diabetes. The results from the proposed method combining between feature selection algorithms and binning show significant improvements in both prediction performance and execution time compared to the state-of-the-art methods. Keywords Gene family abundance · Disease prediction · Metagenomic · Feature selection

T.-H. Nguyen (B) · T.-T. Phan · C.-T. Dao · D.-V.-P. Ta · T.-N.-C. Nguyen · N.-M.-T. Phan · H.-N. Pham College of Information Communication of Technology, Can Tho University, Can Tho, Vietnam e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2021 H. Kim et al. (eds.), IT Convergence and Security, Lecture Notes in Electrical Engineering 712, https://doi.org/10.1007/978-981-15-9354-3_2

19

20

T.-H. Nguyen et al.

1 Introduction Metagenomics (Environmental Genomics, Ecogenomics or Community Genomics) is directly the study of communities of microbial organisms in their natural environments by applying of the modern genomic techniques [1]. The application of metagenomic sequence information will facilitate the design of better culturing strategies to link genomic analysis with pure culture studies. Over the past 20 years, the development of information technology has supported metagenomics analysis, human genome research and genome analysis of pathogenic microorganisms, leading to antibiotic research in the world. At the same time, with the Next Generation Sequencing (NGS) [2] technique, the human genome has been decoded, detecting the rare Crohn disease mutations that have been identified and sought to prevent. Currently, metagenomic is a potential new data source to be applied in supporting primary care and diagnosis for human health. As described in Ehrlich, this data source can assist in diagnosing diseases, for

Data Loading...

Effective Disease Prediction on Gene Family Abundance Using Feature Selection and Binning Approach

Recommend Documents

Hospitalization Cost Prediction for Cardiovascular Disease by Effective Feature Selection

Comparing different feature selection algorithms for cardiovascular disease prediction

Feature selection and risk prediction for patients with coronary artery disease using data mining

A Differential Evolution Approach to Feature Selection and Instance Selection

Prediction Model of Breast Cancer Based on mRMR Feature Selection

A Neuroevolutionary Approach to Feature Selection Using Multiobjective Evolutionary Algorithms

A context-aware recommendation approach based on feature selection

Effective Stochastic Algorithm in Disease Prediction

A new ensemble feature selection approach based on genetic algorithm

Solar radiation prediction using multi-gene genetic programming approach

Simultaneous feature selection and clustering of micro-array and RNA-sequence gene expression data using multiobjective

Considerations for feature selection using gene pairs and applications in large-scale dataset integration, novel oncogen