Fuzzy Clustering of High Dimensional Data with Noise and Outliers
Clustering high dimensional data is a challenging problem for fuzzy clustering algorithms because of so-called concentration of distance phenomenon. The most fuzzy clustering algorithms fail to work on high dimensional data producing cluster prototypes cl
- PDF / 402,452 Bytes
- 15 Pages / 439.37 x 666.142 pts Page_size
- 64 Downloads / 199 Views
Abstract Clustering high dimensional data is a challenging problem for fuzzy clustering algorithms because of so-called concentration of distance phenomenon. The most fuzzy clustering algorithms fail to work on high dimensional data producing cluster prototypes close to the center of gravity of the data set. The presence of noise and outliers in data is an additional problem for clustering algorithms because they might affect the computation of cluster centers. In this paper, we analyze and compare different promising fuzzy clustering algorithms in order to examine their ability to correctly determine cluster centers on high dimensional data with noise and outliers. We analyze the performance of clustering algorithms for different initializations of cluster centers: the original means of clusters and random data points in the data space. Keywords Fuzzy clustering · C-means models · High dimensional data · Noise · Possibilistic clustering
1 Introduction Clustering algorithms are used in many fields like bioinformatics, image processing, text mining, and many others. Data sets in these applications usually contain a large number of features. Therefore, there is a need for clustering algorithms that can handle high dimensional data. The hard k-means algorithm [1] is still mostly used for clustering high dimensional data, although it is comparatively unstable and sensitive to the initialization. It is not able to distinguish data items belonging to clusters from noise and outliers. This is another issue of the hard k-means algorithm because noise L. Himmelspach (B) · S. Conrad Institute of Computer Science, Heinrich-Heine-Universität Düsseldorf, 40225 Düsseldorf, Germany e-mail: [email protected] S. Conrad e-mail: [email protected] © Springer Nature Switzerland AG 2019 J. J. Merelo et al. (eds.), Computational Intelligence, Studies in Computational Intelligence 792, https://doi.org/10.1007/978-3-319-99283-9_11
221
222
L. Himmelspach and S. Conrad
and outliers might influence the computation of cluster centers leading to inaccurate clustering results. In the case of low dimensional data, the fuzzy c-means algorithm (FCM) [2, 3] which assigns data items to clusters with membership degrees might be a better choice because it is more stable and less sensitive to initialization [4]. The possibilistic fuzzy c-means algorithm (PFCM) [5] partitions data items in presence of noise and outliers. However, when FCM is applied on high dimensional data, it tends to produce cluster centers close to the center of gravity of the entire data set [6, 7]. In this work, we analyze different fuzzy clustering methods that are suitable for clustering high dimensional data. The first approach is the attribute weighting fuzzy clustering algorithm [8] that uses a new attribute weighting function to determine attributes that are important for each single cluster. This method was recommended in [7] for fuzzy clustering of high dimensional data. The second approach is the multivariate fuzzy c-means (MFCM) [9] that c
Data Loading...