Dealing with heterogeneity in the context of distributed feature selection for classification

PDF / 2,252,596 Bytes
44 Pages / 439.37 x 666.142 pts Page_size
100 Downloads / 168 Views

Dealing with heterogeneity in the context of distributed feature selection for classification José Luis Morillo-Salas1 · Verónica Bolón-Canedo1

· Amparo Alonso-Betanzos1

Received: 9 October 2019 / Revised: 22 October 2020 / Accepted: 1 November 2020 © Springer-Verlag London Ltd., part of Springer Nature 2020

Abstract Advances in the information technologies have greatly contributed to the advent of larger datasets. These datasets often come from distributed sites, but even so, their large size usually means they cannot be handled in a centralized manner. A possible solution to this problem is to distribute the data over several processors and combine the different results. We propose a methodology to distribute feature selection processes based on selecting relevant and discarding irrelevant features. This preprocessing step is essential for current high-dimensional sets, since it allows the input dimension to be reduced. We pay particular attention to the problem of data imbalance, which occurs because the original dataset is unbalanced or because the dataset becomes unbalanced after data partitioning. Most works approach unbalanced scenarios by oversampling, while our proposal tests both over- and undersampling strategies. Experimental results demonstrate that our distributed approach to classification obtains comparable accuracy results to a centralized approach, while reducing computational time and efficiently dealing with data imbalance. Keywords Feature selection · Distributed learning · Unbalanced data · Oversampling

This research has been financially supported in part by the Spanish Ministerio de Economía y Competitividad (research projects TIN2015-65069-C2-1-R and PID2019-109238GB-C22), by European Union FEDER funds and by the Consellería de Industria of the Xunta de Galicia (research project ED431C 2018/34). Financial support from the Xunta de Galicia (Centro singular de investigación de Galicia accreditation 2016–2019) and the European Union (European Regional Development Fund—ERDF), is gratefully acknowledged (research project ED431G 2019/01).

B

Verónica Bolón-Canedo [email protected] José Luis Morillo-Salas [email protected] Amparo Alonso-Betanzos [email protected]

1

CITIC, Grupo LIDIA, Universidade da Coruña, Campus de Elviña, 15071 A Coruña, Spain

123

J. L. Morillo-Salas et al.

1 Introduction Feature selection (FS) is a popular machine learning technique, whereby attributes that allow a problem to be clearly defined are selected, while irrelevant or redundant attributes are ignored [1]. Traditionally, an FS algorithm is applied in a centralized manner, i.e., a single selector is used to solve a given problem. However, in a big data scenario, data are often distributed, and a distributed learning approach allows multiple subsets of data to be processed in sequence or concurrently. While there are several ways to distribute an FS task, the two most common ways are as follows: (i) an identical FS algorithm is run on data stored together in one very large dataset and distributed int

Data Loading...

Dealing with heterogeneity in the context of distributed feature selection for classification

Recommend Documents

Dealing with Heterogeneity

Univariate Feature Selection Techniques for Classification of Epileptic EEG Signals

Feature and Sample Size Selection for Malware Classification Process

Feature Selection Algorithms for Plant Leaf Classification: A Survey

Feature selection for improved classification accuracy targeting riverine sand mapping

A Feature Selection Approach to Visual Domain Adaptation in Classification

Feature Selection for Clustering

Convolutional Neural Networks with Reusable Full-Dimension-Long Layers for Feature Selection and Classification of Motor

A context-aware recommendation approach based on feature selection

Automated classification of diabetic retinopathy through reliable feature selection

Recent Advances in Ensembles for Feature Selection

Model Selection for Classification