Large margin classifiers to generate synthetic data for imbalanced datasets

  • PDF / 1,431,110 Bytes
  • 17 Pages / 595.224 x 790.955 pts Page_size
  • 25 Downloads / 336 Views

DOWNLOAD

REPORT


Large margin classifiers to generate synthetic data for imbalanced datasets Marcelo Ladeira Marques1 · Saulo Moraes Villela1

· Carlos Cristiano Hasenclever Borges1

© Springer Science+Business Media, LLC, part of Springer Nature 2020

Abstract In this paper we propose the development of an approach capable of improving the results obtained by classification algorithms when applied to imbalanced datasets. The method, called Incremental Synthetic Balancing Algorithm (ISBA), performs an iterative procedure based on large margin classifiers, aiming to generate synthetic samples in order to reduce the level of imbalance. In the process, we use the support vectors as the reference for the generation of new instances, allowing them to be positioned in regions with greater representativeness. Furthermore, the new samples can exceed the limits of the ones used for their generation, which enables extrapolation of the boundaries of the minority class, achieving more significant recognition of this class of interest. We present comparative experiments with other techniques, among them the SMOTE, which provide strong evidence of the applicability of the proposed approach. Keywords Imbalanced learning · Large margin classifiers · Oversampling · Synthetic sample generation

1 Introduction Every day an infinity of data is generated and stored in databases all over the world With the development of new technologies in the different branches of industry, research and business. The data available for the learning process is directly responsible for the performance of the generalization hypothesis obtained by a pre-selected predictive model adopted. This influence acts at several levels of the learning task depending on the data volume, data distribution over attributes space and, mainly, the cardinality balancing of class instances. According to Marsland [1], the learning process could be defined as the improvement to realize a task by means of practice. Thus, machine learning is considered as the rational use of the information presented in a dataset to  Carlos Cristiano Hasenclever Borges

[email protected] Marcelo Ladeira Marques [email protected] Saulo Moraes Villela [email protected] 1

Department of Computer Science, Federal University of Juiz de Fora, Juiz de Fora, Minas Gerais, Brazil

enhance the algorithm performance, being categorized into four main groups: supervised, unsupervised, reinforcement and evolutionary learning. This work treats the supervised learning in the context of the classification problem, which aims to separate samples by determining an induction hypothesis. In the last decades, several classifier algorithms were proposed and evaluated, e.g.: decision trees [2], approaches based on Perceptron [3, 4], Radial Basis Function (RBF) [5], Support Vector Machine (SVM) [6], among others. These algorithms, despite presenting reasonable prediction potential in a large gamma of applications, in some cases obtain suboptimal results due to some properties and patterns of the related dataset sub