An ensemble approach to outlier detection using some conventional clustering algorithms
- PDF / 892,619 Bytes
- 25 Pages / 439.37 x 666.142 pts Page_size
- 49 Downloads / 312 Views
An ensemble approach to outlier detection using some conventional clustering algorithms Akash Saha 1 & Agneet Chatterjee 1 & Soulib Ghosh 1 & Neeraj Kumar 2,3
& Ram Sarkar
1
Received: 14 February 2020 / Revised: 8 August 2020 / Accepted: 13 August 2020 # Springer Science+Business Media, LLC, part of Springer Nature 2020
Abstract
Outlier detection is an important requirement in data mining and machine learning. When data mining and machine learning algorithms are applied on the datasets with outliers, it leads to erroneous conclusion about the data. Therefore, researchers have been working in this field to remove outliers from dataset so that meaningful information from the datasets can be retrieved. In this paper, we take a cluster based ensemble approach for outlier detection, the backbone of which are some conventional clustering algorithms. Keeping in mind the drawbacks of supervised and semi supervised learning, we have relied on unsupervised learning algorithms. For our cluster based ensemble approach, we use three clustering algorithms, namely K-means, K-means++, and Fuzzy C-means. Our model intelligently combines results from individual clustering algorithms, assigning probabilities to each data point in order to decide its belongingness to a certain cluster. We have proposed a technique to assign a membership value to a data point in case of hard clustering algorithms, as we want to keep the flexibility of combining hard and soft clustering algorithms. From the probabilities assigned by the ensemble model, we then identify the outliers from the dataset. After removing these data points from the dataset, we obtain better values of cluster validity indices, thus reaffirming that removal of outliers has resulted in more stringent clusters of data. We have used five different cluster validity indices in our work to measure the goodness of the clusters formed, considering eight widely used datasets for evaluation of the proposed model amongst which three are large datasets. We have noticed a significant improvement in the cluster validity indices after applying our outlier detection algorithm. The experimental results prove that the proposed method is empirically sound. Keywords Outlier detection . K-means . Fuzzy C-means . K-means++ . Ensemble approach
* Neeraj Kumar [email protected] Extended author information available on the last page of the article
Multimedia Tools and Applications
1 Introduction Outlier detection denotes the problem of probing data patterns that do not possess normal characteristics compared to other data patterns. Many terminologies are used to refer these type of anomalous data patterns like – outliers, anomalies, discordant observations, exceptions, faults, defects, aberrations, noise, errors, damage, surprise, novelty, peculiarities or contaminants in different application domains. The occurrences of such outliers are due to malicious activity (credit card or telecom fraud data), instrumentation error (data taken from defective component of any machine), change in the e
Data Loading...