The Effects of Data Sampling with Deep Learning and Highly Imbalanced Big Data

  • PDF / 955,157 Bytes
  • 19 Pages / 595.224 x 790.955 pts Page_size
  • 88 Downloads / 217 Views

DOWNLOAD

REPORT


The Effects of Data Sampling with Deep Learning and Highly Imbalanced Big Data Justin M. Johnson1

· Taghi M. Khoshgoftaar1

© Springer Science+Business Media, LLC, part of Springer Nature 2020

Abstract Training predictive models with class-imbalanced data has proven to be a difficult task. This problem is well studied, but the era of big data is producing more extreme levels of imbalance that are increasingly difficult to model. We use three data sets of varying complexity to evaluate data sampling strategies for treating high class imbalance with deep neural networks and big data. Sampling rates are varied to create training distributions with positive class sizes from 0.025%–90%. The area under the receiver operating characteristics curve is used to compare performance, and thresholding is used to maximize class performance. Random over-sampling (ROS) consistently outperforms under-sampling (RUS) and baseline methods. The majority class proves susceptible to misrepresentation when using RUS, and results suggest that each data set is uniquely sensitive to imbalance and sample size. The hybrid ROS-RUS maximizes performance and efficiency, and is our preferred method for treating high imbalance within big data problems. Keywords Class imbalance · Big data · Data sampling · Artificial neural networks · Deep learning

1 Introduction Class imbalance occurs when one class, the majority group, is significantly larger than the opposing minority class. This phenomenon arises in many critical industries, e.g. medical (Rao et al. 2006), financial (Wei et al. 2013), and environmental (Kubat et al. 1998). In these examples, the minority group is also the positive class, or the class of interest. When class imbalance exists within training data, learners will typically over-classify the majority group due to its increased prior probability. As a result, the instances belonging to the minority group are misclassified more often than those belonging to the majority group. Furthermore, some evaluation metrics, such as accuracy, may mislead an analyst with high scores that incorrectly indicate good performance. For example, given a binary data set with a positive class size of just 1%, a simple learner that always outputs the negative class will score 99%  Justin M. Johnson

[email protected] Taghi M. Khoshgoftaar [email protected] 1

Florida Atlantic University, 777 Glades Road, Boca Raton, FL 33431, USA

accuracy. Learning from these class-imbalanced data sets can be very difficult, and advanced learning methods are usually required to obtain meaningful results. In this paper, we denote the level of class imbalance in a binary data set using the ratio of negative samples to positive samples, i.e. nneg : npos . For example, the imbalance of a data set with 400 negative samples and 100 positive instances is denoted by 80:20. Equivalently, we sometimes refer to a binary data set using just the size of the positive class, e.g. 20% positive class size. In Section 2, we adopt the notation of Buda et al. (Buda et al. 2018) and denote le