Combined oversampling and undersampling method based on slow-start algorithm for imbalanced network traffic
- PDF / 1,122,112 Bytes
- 24 Pages / 439.37 x 666.142 pts Page_size
- 119 Downloads / 229 Views
Combined oversampling and undersampling method based on slow-start algorithm for imbalanced network traffic Seunghyun Park1 · Hyunhee Park2 Received: 13 September 2020 / Accepted: 12 October 2020 © Springer-Verlag GmbH Austria, part of Springer Nature 2020
Abstract Network traffic data basically comprise a major amount of normal traffic data and a minor amount of attack data. Such an imbalance problem in the amounts of the two types of data reduces prediction performance, such as by prediction bias of the minority data and miscalculation of normal data as outliers. To address the imbalance problem, representative sampling methods include various minority data synthesis models based on oversampling. However, as the oversampling method for resolving the imbalance problem involves repeatedly learning the same data, the classification model can overfit the learning data. Meanwhile, the undersampling methods proposed to address the imbalance problem can cause information loss because they remove data. To improve the performance of these oversampling and undersampling approaches, we propose an oversampling ensemble method based on the slow-start algorithm. The proposed combined oversampling and undersampling method based on the slow-start (COUSS) algorithm is based on the congestion control algorithm of the transmission control protocol. Therefore, an imbalanced dataset oversamples until overfitting occurs, based on a minimally applied undersampling dataset. The simulation results obtained using the KDD99 dataset show that the proposed COUSS method improves the F1 score by 8.639%, 6.858%, 5.003%, and 4.074% compared to synthetic minority oversampling technique (SMOTE), borderline-SMOTE, adaptive synthetic sampling, and generative adversarial network oversampling algorithms, respectively. Therefore, the COUSS method can be perceived as a practical solution in data analysis applications. Keywords Machine learning · Oversampling · Undersampling · Imbalanced data · TCP · KDD99 Mathematics Subject Classification 68T20 · 68P01 · 68M20 · 65Y04
B
Hyunhee Park [email protected]
Extended author information available on the last page of the article
123
S. Park, H. Park
1 Introduction A classification problem is a process related to categorization, which is the problem of predicting the classes of input data. To solve the classification problem, a machinelearning algorithm is trained using a given dataset. In this case, the ideal dataset should have a uniform distribution of classes to be classified. However, most datasets have different amounts of data for each class, and in severe cases, the data can be concentrated in only one class. As machine-learning algorithms assume that each class has a similar proportion, if the classes of a dataset are unbalanced, the training is biased to classes that take up large proportions instead of being performed properly for all the data [1]. Thus, general machine-learning algorithms perform training under the assumption that the training data are composed of classes with similar proportio
Data Loading...