Combined oversampling and undersampling method based on slow-start algorithm for imbalanced network traffic

PDF / 1,122,112 Bytes
24 Pages / 439.37 x 666.142 pts Page_size
119 Downloads / 262 Views

Combined oversampling and undersampling method based on slow-start algorithm for imbalanced network traffic Seunghyun Park1 · Hyunhee Park2 Received: 13 September 2020 / Accepted: 12 October 2020 © Springer-Verlag GmbH Austria, part of Springer Nature 2020

Abstract Network traffic data basically comprise a major amount of normal traffic data and a minor amount of attack data. Such an imbalance problem in the amounts of the two types of data reduces prediction performance, such as by prediction bias of the minority data and miscalculation of normal data as outliers. To address the imbalance problem, representative sampling methods include various minority data synthesis models based on oversampling. However, as the oversampling method for resolving the imbalance problem involves repeatedly learning the same data, the classification model can overfit the learning data. Meanwhile, the undersampling methods proposed to address the imbalance problem can cause information loss because they remove data. To improve the performance of these oversampling and undersampling approaches, we propose an oversampling ensemble method based on the slow-start algorithm. The proposed combined oversampling and undersampling method based on the slow-start (COUSS) algorithm is based on the congestion control algorithm of the transmission control protocol. Therefore, an imbalanced dataset oversamples until overfitting occurs, based on a minimally applied undersampling dataset. The simulation results obtained using the KDD99 dataset show that the proposed COUSS method improves the F1 score by 8.639%, 6.858%, 5.003%, and 4.074% compared to synthetic minority oversampling technique (SMOTE), borderline-SMOTE, adaptive synthetic sampling, and generative adversarial network oversampling algorithms, respectively. Therefore, the COUSS method can be perceived as a practical solution in data analysis applications. Keywords Machine learning · Oversampling · Undersampling · Imbalanced data · TCP · KDD99 Mathematics Subject Classification 68T20 · 68P01 · 68M20 · 65Y04

B

Hyunhee Park [email protected]

Extended author information available on the last page of the article

123

S. Park, H. Park

1 Introduction A classification problem is a process related to categorization, which is the problem of predicting the classes of input data. To solve the classification problem, a machinelearning algorithm is trained using a given dataset. In this case, the ideal dataset should have a uniform distribution of classes to be classified. However, most datasets have different amounts of data for each class, and in severe cases, the data can be concentrated in only one class. As machine-learning algorithms assume that each class has a similar proportion, if the classes of a dataset are unbalanced, the training is biased to classes that take up large proportions instead of being performed properly for all the data [1]. Thus, general machine-learning algorithms perform training under the assumption that the training data are composed of classes with similar proportio

Data Loading...

Combined oversampling and undersampling method based on slow-start algorithm for imbalanced network traffic

Recommend Documents

Overlap-Based Undersampling Method for Classification of Imbalanced Medical Datasets

MUEnsemble: Multi-ratio Undersampling-Based Ensemble Framework for Imbalanced Data

LoRAS: an oversampling approach for imbalanced datasets

Traffic Planning Method of Smart City Based on Network Optimization

A Network Traffic Classification Method Based on Hierarchical Clustering

Highway Network Traffic Survey Point Layout Planning Method Based on Machine Learning-Optimization Hybrid Algorithm

Network Traffic Prediction Method Based on Time Series Characteristics

Dynamic clustering method for imbalanced learning based on AdaBoost

Improved community structure discovery algorithm based on combined clique percolation method and K-means algorithm

DBCSMOTE: a clustering-based oversampling technique for data-imbalanced warfarin dose prediction

U-Net Neural Network Optimization Method Based on Deconvolution Algorithm

Imbalanced Data Classification Method Based on Clustering and Voting Mechanism