Statistical hierarchical clustering algorithm for outlier detection in evolving data streams

PDF / 6,675,745 Bytes
46 Pages / 439.37 x 666.142 pts Page_size
39 Downloads / 236 Views

Statistical hierarchical clustering algorithm for outlier detection in evolving data streams Dalibor Krleža1 · Boris Vrdoljak1 · Mario Brčić1 Received: 16 September 2019 / Revised: 2 July 2020 / Accepted: 11 August 2020 © The Author(s), under exclusive licence to Springer Science+Business Media LLC, part of Springer Nature 2020

Abstract Anomaly detection is a hard data analysis process that requires constant creation and improvement of data analysis algorithms. Using traditional clustering algorithms to analyse data streams is impossible due to processing power and memory issues. To solve this, the traditional clustering algorithm complexity needed to be reduced, which led to the creation of sequential clustering algorithms. The usual approach is two-phase clustering, which uses online phase to relax data details and complexity, and offline phase to cluster concepts created in the online phase. Detecting anomalies in a data stream is usually solved in the online phase, as it requires unreduced data. Contrarily, producing good macro-clustering is done in the offline phase, which is the reason why two-phase clustering algorithms have difficulty being equally good in anomaly detection and macro-clustering. In this paper, we propose a statistical hierarchical clustering algorithm equally suitable for both detecting anomalies and macro-clustering. The proposed algorithm is single-phased and uses statistical inference on the input data stream, resulting in statistical distributions that are constantly updated. This makes the classification adaptable, allowing agglomeration of outliers into clusters, tracking population evolution, and to be used without knowing the expected number of clusters and outliers. The proposed algorithm was tested against typical clustering algorithms, including two-phase algorithms suitable for data stream analysis. A number of typical test cases were selected, to show the universality and qualities of the proposed clustering algorithm. Keywords Big data · Clustering · Anomaly detection · Fraud detection

Editor: Joao Gama. This research has been supported by the European Regional Development Fund under the Grant KK.01.1.1.01.0009 (DATACROSS). * Dalibor Krleža [email protected] Boris Vrdoljak [email protected] Mario Brčić [email protected] 1

Faculty of Electrical Engineering and Computing, University of Zagreb, Unska 3, Zagreb, Croatia

13

Vol.:(0123456789)

Machine Learning

1 Introduction Today, we create, collect, and process more data than ever before. All this data holds many patterns of interest. Most of the patterns are regularly occurring in the data. Finding these typical patterns can help to identify outliers, i.e., anomalies that occur sparsely. The more data is generated, the more patterns and outliers we are able to find, which leads to the big data paradigm, i.e., endless data streams that need to be continuously analysed in search of typical data patterns and outliers. Data clustering algorithms are one of many solutions that can be used to perform analysis of

Data Loading...

Statistical hierarchical clustering algorithm for outlier detection in evolving data streams

Recommend Documents

Feature Drift Detection in Evolving Data Streams

An Optimized Approach of Outlier Detection Algorithm for Outlier Attributes on Data Streams

A Fast Distance-Based Outlier Detection Technique Using a Divisive Hierarchical Clustering Algorithm

UWFP-Outlier: an efficient frequent-pattern-based outlier detection method for uncertain weighted data streams

Study on Statistical Outlier Detection and Labelling

Minimal Rare-Pattern-Based Outlier Detection Method for Data Streams by Considering Anti-monotonic Constraints

Abstraction-Based Outlier Detection for Image Data

Concept learning using one-class classifiers for implicit drift detection in evolving data streams

Outlier Robust Geodesic K-means Algorithm for High Dimensional Data

Fast Dynamic Density Outlier Detection Algorithm for Power Quality Disturbance Data

Outlier Detection for Data Using Density-Based Technique

A Minimum Spanning Tree Clustering-Inspired Outlier Detection Technique