usfAD : a robust anomaly detector based on unsupervised stochastic forest
- PDF / 3,176,094 Bytes
- 14 Pages / 595.276 x 790.866 pts Page_size
- 96 Downloads / 208 Views
ORIGINAL ARTICLE
usfAD: a robust anomaly detector based on unsupervised stochastic forest Sunil Aryal1 · K.C. Santosh2 · Richard Dazeley1 Received: 11 April 2020 / Accepted: 16 October 2020 © Springer-Verlag GmbH Germany, part of Springer Nature 2020
Abstract In real-world applications, data can be represented using different units/scales. For example, weight in kilograms or pounds and fuel-efficiency in km/l or l/100 km. One unit can be a linear or non-linear scaling of another. The variation in metrics due to the non-linear scaling makes Anomaly Detection (AD) challenging. Most existing AD algorithms rely on distanceor density-based functions, which makes them sensitive to how data is expressed. This means that they are representation dependent. To avoid such a problem, we introduce a new anomaly detection method, which we call ‘usfAD: Unsupervised Stochastic Forest-based Anomaly Detector’. Our empirical evaluation in synthetic and real-world cybersecurity (spam detection, malicious URL detection and intrusion detection) datasets shows that our approach is more robust to the variation in units/scales used to express data. It produces more consistent and better results than five state-of-the-art AD methods namely: local outlier factor; one-class support vector machine; isolation forest; nearest neighbor in a random subsample of data; and, simple histogram-based probabilistic method. Keywords Measurement scales and units · Anomaly detection · Outlier detection · Robust anomaly detection · Intrusion detection · Spam detection · And cyber security
1 Introduction
• Intrusion detection Detecting unauthorised access
1.1 Background
• Fraud detection Detecting fraudulent and suspicious
Anomalies (also sometimes referred to as outliers) are data instances that are significantly different from most of the other data causing suspicions that they were generating from a different mechanism from the one that is normal or expected [23]. Anomaly Detection (AD) is the task of detecting anomalies in a given dataset automatically using computers and algorithms [16]. It has many applications such as [1]:
• Spam detection Detecting malicious and phishing emails
* K.C. Santosh [email protected] Sunil Aryal [email protected] Richard Dazeley [email protected] 1
School of Information Technology, Deakin University, 75 Pigdons Rd, Waurn Ponds, VIC 3216, Australia
Department of Computer Science, University of South Dakota, 414 E Clark St, Vermillion, SD 57069, USA
2
requests and malicious activities in computer networks. credit card and other financial transactions in banking. in electronic communications.
Most existing anomaly detection algorithms [3, 4, 15, 26] assume that anomalies have feature values that are significantly different from those of normal instances. In other words, anomalies are few and different and they lie in low density regions.
2 Motivation In real-world applications, features of data objects can be measured in different units or recorded in different scales [5, 6, 19, 3
Data Loading...