An ensemble approach to outlier detection using some conventional clustering algorithms

PDF / 892,619 Bytes
25 Pages / 439.37 x 666.142 pts Page_size
49 Downloads / 343 Views

An ensemble approach to outlier detection using some conventional clustering algorithms Akash Saha 1 & Agneet Chatterjee 1 & Soulib Ghosh 1 & Neeraj Kumar 2,3

& Ram Sarkar

1

Received: 14 February 2020 / Revised: 8 August 2020 / Accepted: 13 August 2020 # Springer Science+Business Media, LLC, part of Springer Nature 2020

Abstract

Outlier detection is an important requirement in data mining and machine learning. When data mining and machine learning algorithms are applied on the datasets with outliers, it leads to erroneous conclusion about the data. Therefore, researchers have been working in this field to remove outliers from dataset so that meaningful information from the datasets can be retrieved. In this paper, we take a cluster based ensemble approach for outlier detection, the backbone of which are some conventional clustering algorithms. Keeping in mind the drawbacks of supervised and semi supervised learning, we have relied on unsupervised learning algorithms. For our cluster based ensemble approach, we use three clustering algorithms, namely K-means, K-means++, and Fuzzy C-means. Our model intelligently combines results from individual clustering algorithms, assigning probabilities to each data point in order to decide its belongingness to a certain cluster. We have proposed a technique to assign a membership value to a data point in case of hard clustering algorithms, as we want to keep the flexibility of combining hard and soft clustering algorithms. From the probabilities assigned by the ensemble model, we then identify the outliers from the dataset. After removing these data points from the dataset, we obtain better values of cluster validity indices, thus reaffirming that removal of outliers has resulted in more stringent clusters of data. We have used five different cluster validity indices in our work to measure the goodness of the clusters formed, considering eight widely used datasets for evaluation of the proposed model amongst which three are large datasets. We have noticed a significant improvement in the cluster validity indices after applying our outlier detection algorithm. The experimental results prove that the proposed method is empirically sound. Keywords Outlier detection . K-means . Fuzzy C-means . K-means++ . Ensemble approach

* Neeraj Kumar [email protected] Extended author information available on the last page of the article

Multimedia Tools and Applications

1 Introduction Outlier detection denotes the problem of probing data patterns that do not possess normal characteristics compared to other data patterns. Many terminologies are used to refer these type of anomalous data patterns like – outliers, anomalies, discordant observations, exceptions, faults, defects, aberrations, noise, errors, damage, surprise, novelty, peculiarities or contaminants in different application domains. The occurrences of such outliers are due to malicious activity (credit card or telecom fraud data), instrumentation error (data taken from defective component of any machine), change in the e

Data Loading...

An ensemble approach to outlier detection using some conventional clustering algorithms

Recommend Documents

Some Adaptive Clustering Algorithms

New Developments in Unsupervised Outlier Detection Algorithms an

Cross-Validation Approach to Evaluate Clustering Algorithms: An Experimental Study Using Multi-Label Datasets

An Optimized Approach of Outlier Detection Algorithm for Outlier Attributes on Data Streams

Correction to: Fair Outlier Detection

A Minimum Spanning Tree Clustering-Inspired Outlier Detection Technique

Lightweight Classifier-Based Outlier Detection Algorithms from Multivariate Data Stream

A k-Nearest Neighbour Spectral Clustering-Based Outlier Detection Technique

Cost Effective Method for Ransomware Detection: An Ensemble Approach

Outlier Detection

A Heuristic Approach to Possibilistic Clustering: Algorithms and Applications

A Fast Distance-Based Outlier Detection Technique Using a Divisive Hierarchical Clustering Algorithm