Large margin classifiers to generate synthetic data for imbalanced datasets

PDF / 1,431,110 Bytes
17 Pages / 595.224 x 790.955 pts Page_size
25 Downloads / 336 Views

Large margin classiﬁers to generate synthetic data for imbalanced datasets Marcelo Ladeira Marques1 · Saulo Moraes Villela1

· Carlos Cristiano Hasenclever Borges1

© Springer Science+Business Media, LLC, part of Springer Nature 2020

Abstract In this paper we propose the development of an approach capable of improving the results obtained by classification algorithms when applied to imbalanced datasets. The method, called Incremental Synthetic Balancing Algorithm (ISBA), performs an iterative procedure based on large margin classifiers, aiming to generate synthetic samples in order to reduce the level of imbalance. In the process, we use the support vectors as the reference for the generation of new instances, allowing them to be positioned in regions with greater representativeness. Furthermore, the new samples can exceed the limits of the ones used for their generation, which enables extrapolation of the boundaries of the minority class, achieving more significant recognition of this class of interest. We present comparative experiments with other techniques, among them the SMOTE, which provide strong evidence of the applicability of the proposed approach. Keywords Imbalanced learning · Large margin classifiers · Oversampling · Synthetic sample generation

1 Introduction Every day an infinity of data is generated and stored in databases all over the world With the development of new technologies in the different branches of industry, research and business. The data available for the learning process is directly responsible for the performance of the generalization hypothesis obtained by a pre-selected predictive model adopted. This influence acts at several levels of the learning task depending on the data volume, data distribution over attributes space and, mainly, the cardinality balancing of class instances. According to Marsland [1], the learning process could be defined as the improvement to realize a task by means of practice. Thus, machine learning is considered as the rational use of the information presented in a dataset to Carlos Cristiano Hasenclever Borges

[email protected] Marcelo Ladeira Marques [email protected] Saulo Moraes Villela [email protected] 1

Department of Computer Science, Federal University of Juiz de Fora, Juiz de Fora, Minas Gerais, Brazil

enhance the algorithm performance, being categorized into four main groups: supervised, unsupervised, reinforcement and evolutionary learning. This work treats the supervised learning in the context of the classification problem, which aims to separate samples by determining an induction hypothesis. In the last decades, several classifier algorithms were proposed and evaluated, e.g.: decision trees [2], approaches based on Perceptron [3, 4], Radial Basis Function (RBF) [5], Support Vector Machine (SVM) [6], among others. These algorithms, despite presenting reasonable prediction potential in a large gamma of applications, in some cases obtain suboptimal results due to some properties and patterns of the related dataset sub

Data Loading...

Large margin classifiers to generate synthetic data for imbalanced datasets

Recommend Documents

A Toolkit to Generate Social Navigation Datasets

The classification of imbalanced large data sets based on MapReduce and ensemble of ELM classifiers

LoRAS: an oversampling approach for imbalanced datasets

Symbolic Data Analysis Approach to Clustering Large Datasets

SynCGAN: Using Learnable Class Specific Priors to Generate Synthetic Data for Improving Classifier Performance on Cytolo

Real and Synthetic Test Datasets

Overlap-Based Undersampling Method for Classification of Imbalanced Medical Datasets

Employing Decision Templates to Imbalanced Data Classification

Large Margin Methods for Structured Output Prediction

A Novel Approach for Generating Synthetic Datasets for Digital Forensics

Generating 2.5D Photorealistic Synthetic Datasets for Training Machine Vision Algorithms

Improving Generalization Abilities of Maximal Average Margin Classifiers