Effective Disease Prediction on Gene Family Abundance Using Feature Selection and Binning Approach

Metagenomic is now a novel source for supporting diagnosis and prognosis human diseases. Numerous studies have pointed to crucial roles of metagenomics in personalized medicine approaches. Recent years, machine learning has been widely deploying in a vast

  • PDF / 405,654 Bytes
  • 10 Pages / 439.37 x 666.142 pts Page_size
  • 72 Downloads / 184 Views

DOWNLOAD

REPORT


Abstract Metagenomic is now a novel source for supporting diagnosis and prognosis human diseases. Numerous studies have pointed to crucial roles of metagenomics in personalized medicine approaches. Recent years, machine learning has been widely deploying in a vast amount of metagenomic research. Usually, gene family data are characterized by very high dimension which can be up to millions of features. However, the number of obtained samples is rather small compared to the number of attributes. Therefore, the results in validation sets often exhibit poor performance while we can get high accuracy during training phrases. Moreover, a very large number of features on each gene family dataset consumes a considerable time in processing and learning. In this study, we propose feature selection methods using Ridge Regression on datasets including gene families, then the new obtained set of features is binned by an equal width binning approach and fetched into either a Linear Regression and a One-Dimensional Convolutional Neural Network (CNN1D) to do prediction tasks. The experiments are examined on more than 1000 samples of gene family abundance datasets related to Liver Cirrhosis, Colorectal Cancer, Inflammatory Bowel Disease, Obesity and Type 2 Diabetes. The results from the proposed method combining between feature selection algorithms and binning show significant improvements in both prediction performance and execution time compared to the state-of-the-art methods. Keywords Gene family abundance · Disease prediction · Metagenomic · Feature selection

T.-H. Nguyen (B) · T.-T. Phan · C.-T. Dao · D.-V.-P. Ta · T.-N.-C. Nguyen · N.-M.-T. Phan · H.-N. Pham College of Information Communication of Technology, Can Tho University, Can Tho, Vietnam e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2021 H. Kim et al. (eds.), IT Convergence and Security, Lecture Notes in Electrical Engineering 712, https://doi.org/10.1007/978-981-15-9354-3_2

19

20

T.-H. Nguyen et al.

1 Introduction Metagenomics (Environmental Genomics, Ecogenomics or Community Genomics) is directly the study of communities of microbial organisms in their natural environments by applying of the modern genomic techniques [1]. The application of metagenomic sequence information will facilitate the design of better culturing strategies to link genomic analysis with pure culture studies. Over the past 20 years, the development of information technology has supported metagenomics analysis, human genome research and genome analysis of pathogenic microorganisms, leading to antibiotic research in the world. At the same time, with the Next Generation Sequencing (NGS) [2] technique, the human genome has been decoded, detecting the rare Crohn disease mutations that have been identified and sought to prevent. Currently, metagenomic is a potential new data source to be applied in supporting primary care and diagnosis for human health. As described in Ehrlich, this data source can assist in diagnosing diseases, for