Estimating the number of clusters via a corrected clustering instability
- PDF / 715,342 Bytes
- 16 Pages / 439.37 x 666.142 pts Page_size
- 59 Downloads / 206 Views
Estimating the number of clusters via a corrected clustering instability Jonas M. B. Haslbeck1 · Dirk U. Wulff2,3 Received: 4 September 2017 / Accepted: 12 March 2020 © The Author(s) 2020
Abstract We improve instability-based methods for the selection of the number of clusters k in cluster analysis by developing a corrected clustering distance that corrects for the unwanted influence of the distribution of cluster sizes on cluster instability. We show that our corrected instability measure outperforms current instability-based measures across the whole sequence of possible k, overcoming limitations of current insabilitybased methods for large k. We also compare, for the first time, model-based and model-free approaches to determining cluster-instability and find their performance to be comparable. We make our method available in the R-package cstab. Keywords Cluster analysis · k-means · Stability · Resampling
1 Introduction A central problem in cluster analysis is selecting the number of clusters k. This problem is typically approached by assuming the existence of a true number of clusters k ∗ that can be estimated via an objective function that defines the quality of a clustering. Different definitions have been proposed and it is generally accepted that the usefulness of a definition depends on the clustering problem at hand (see e.g., Friedman et al. 2001; Hennig 2015).
B
Jonas M. B. Haslbeck [email protected] http://www.jonashaslbeck.com Dirk U. Wulff [email protected] https://www.dirkwulff.org/
1
Psychological Methods Group, University of Amsterdam, Amsterdam, The Netherlands
2
Center for Cognitive and Decision Science, University of Basel, Basel, Switzerland
3
Center for Adaptive Rationality, Max Planck Institute for Human Development, Berlin, Germany
123
J. M. B. Haslbeck, D. U. Wulff
Most definitions characterize the quality of a clustering in terms of a distance metric that depends on the locations and cluster assignments of the clustered objects. Methods relying on such definitions select k by trading-off the magnitude of the distance metric or some transformation of it against the magnitude of k. The most commonly used distance metric is the within-cluster dissimilarity W (k) of within-cluster object pairs averaged across all clusters. When selecting k based on this metric it is assumed that W (k) exhibits a kink at the true cluster number k = k ∗ . This is because adding more clusters beyond k ∗ will decrease W (k) only by a relatively small amount, since new clusters are created from clusters that already are relatively homogeneous. All methods focusing on the distances between objects and clusters, in one way or another, aim to identify this kink. Two examples are the Gap statistic (Tibshirani et al. 2001) and the Jump statistic (Sugar and James 2003). Related metrics are the Silhouette statistic (Rousseeuw 1987), which is an index of cluster separation rather than variance, and a variant thereof, the Slope statistic (Fujita et al. 2014). In contrast, the approach investigated in
Data Loading...