Estimating the number of clusters via a corrected clustering instability

PDF / 715,342 Bytes
16 Pages / 439.37 x 666.142 pts Page_size
59 Downloads / 236 Views

Estimating the number of clusters via a corrected clustering instability Jonas M. B. Haslbeck1 · Dirk U. Wulﬀ2,3 Received: 4 September 2017 / Accepted: 12 March 2020 © The Author(s) 2020

Abstract We improve instability-based methods for the selection of the number of clusters k in cluster analysis by developing a corrected clustering distance that corrects for the unwanted influence of the distribution of cluster sizes on cluster instability. We show that our corrected instability measure outperforms current instability-based measures across the whole sequence of possible k, overcoming limitations of current insabilitybased methods for large k. We also compare, for the first time, model-based and model-free approaches to determining cluster-instability and find their performance to be comparable. We make our method available in the R-package cstab. Keywords Cluster analysis · k-means · Stability · Resampling

1 Introduction A central problem in cluster analysis is selecting the number of clusters k. This problem is typically approached by assuming the existence of a true number of clusters k ∗ that can be estimated via an objective function that defines the quality of a clustering. Different definitions have been proposed and it is generally accepted that the usefulness of a definition depends on the clustering problem at hand (see e.g., Friedman et al. 2001; Hennig 2015).

B

Jonas M. B. Haslbeck [email protected] http://www.jonashaslbeck.com Dirk U. Wulff [email protected] https://www.dirkwulff.org/

1

Psychological Methods Group, University of Amsterdam, Amsterdam, The Netherlands

2

Center for Cognitive and Decision Science, University of Basel, Basel, Switzerland

3

Center for Adaptive Rationality, Max Planck Institute for Human Development, Berlin, Germany

123

J. M. B. Haslbeck, D. U. Wulff

Most definitions characterize the quality of a clustering in terms of a distance metric that depends on the locations and cluster assignments of the clustered objects. Methods relying on such definitions select k by trading-off the magnitude of the distance metric or some transformation of it against the magnitude of k. The most commonly used distance metric is the within-cluster dissimilarity W (k) of within-cluster object pairs averaged across all clusters. When selecting k based on this metric it is assumed that W (k) exhibits a kink at the true cluster number k = k ∗ . This is because adding more clusters beyond k ∗ will decrease W (k) only by a relatively small amount, since new clusters are created from clusters that already are relatively homogeneous. All methods focusing on the distances between objects and clusters, in one way or another, aim to identify this kink. Two examples are the Gap statistic (Tibshirani et al. 2001) and the Jump statistic (Sugar and James 2003). Related metrics are the Silhouette statistic (Rousseeuw 1987), which is an index of cluster separation rather than variance, and a variant thereof, the Slope statistic (Fujita et al. 2014). In contrast, the approach investigated in

Data Loading...

Estimating the number of clusters via a corrected clustering instability

Recommend Documents

Estimating the Number of Components

An entropy-based initialization method of K -means clustering on the optimal number of clusters

Assessing the Number of Clusters of the Latent Class Model

Communication Lower Bounds Via the Chromatic Number

A Fuzzy Clustering Algorithm Based on Weighted Index and Optimization of Clustering Number

Instability Analysis of Strained Interfaces Via a Discrete Atom Method

Estimating Number of Columns in Mixing Matrix for Under-Determined ICA Using Observed Signal Clustering and Exponential

Estimating the number of good permutations by a modified fast simulation method

Constrained Clustering via Post-processing

Robust Subspace Clustering via Latent Smooth Representation Clustering

cs-means : Determining optimal number of clusters based on a level-of-similarity

Estimating the number of usability problems affecting medical devices: modelling the discovery matrix