k -Means, Ward and Probabilistic Distance-Based Clustering Methods with Contiguity Constraint

PDF / 2,156,316 Bytes
40 Pages / 439.37 x 666.142 pts Page_size
24 Downloads / 164 Views

k-Means, Ward and Probabilistic Distance-Based Clustering Methods with Contiguity Constraint Andrzej Młodak 1,2 # The Author(s) 2020

Abstract

We analyze some possibilities of using contiguity (neighbourhood) matrix as a constraint in the clustering made by the k-means and Ward methods as well as by an approach based on distances and probabilistic assignments aimed at obtaining a solution of the multi-facility location problem (MFLP). That is, some special two-stage algorithms being the kinds of clustering with relational constraint are proposed. They optimize division of set of objects into clusters respecting the requirement that neighbours have to belong to the same cluster. In the case of the probabilistic d-clustering, relevant modification of its target function is suggested and studied. Versatile simulation study and empirical analysis verify the practical efficiency of these methods. The quality of clustering is assessed on the basis of indices of homogeneity, heterogeneity and correctness of clusters as well as the silhouette index. Using these tools and similarity indices (Rand, Peirce and Sokal and Sneath), it was shown that the probabilistic d-clustering can produce better results than Ward’s algorithm. In comparison with the k-means approach, the probabilistic d-clustering—although gives rather similar results—is more robust to creation of trivial (of which empty) clusters and produces less diversified (in replications, in terms of correctness) results than k-means approach, i.e. is more predictable from the point of view of the clustering quality. Keywords Contiguity constraint . k-Means method . Ward method . Probabilistic d-clustering . Correctness of clustering . Silhouette index . Rand index . Peirce index . Sokal and Sneath index

1 Introduction The cluster analysis, aimed at efficient classifying given multidimensional objects (i.e. objects described by many statistical features reflecting a composite social or economic phenomenon,

* Andrzej Młodak [email protected]

1

Statistical Office in Poznań, Branch in Kalisz, ul. Piwonicka 7–9, 62–800 Kalisz, Poland

2

The President Stanisław Wojciechowski State University of Applied Sciences in Kalisz, Inter-Faculty Department of Mathematics and Statistics, ul. Nowy Świat 4, 62–800 Kalisz, Poland

Journal of Classification

such as labour market, environmental protection, economic situation) to internally homogeneous and mutually heterogeneous classes, offers many interesting algorithms. They can be of various types, such as translational (when object can be translated from one predefined cluster to another until each object is located in cluster which centre is closest to it), hierarchical (when at each step the closest clusters are joined into one) and probabilistic (if the assignment of objects to clusters is defined by especially optimize probabilities). The first of the aforementioned classes is represented by the k-means method (see e.g. Ball and Hall (1967) or Everitt et al. (2011)), where k is the arbitrarily established number of cluster

Data Loading...

k -Means, Ward and Probabilistic Distance-Based Clustering Methods with Contiguity Constraint

Recommend Documents

k-Means Clustering

A Novel MapReduce Based k-Means Clustering

Fuzzy K-means clustering with fast density peak clustering on multivariate kernel estimator with evolutionary multimodal

Applying K-means Clustering and Genetic Algorithm for Solving MTSP

A Quantum-Inspired Genetic K-Means Algorithm for Gene Clustering

Clustering Analysis of Extreme Temperature Based on K-means Algorithm

Probabilistic inductive constraint logic

K-Means Clustering for Features Arrangement in Metagenomic Data Visualization

K-means tree: an optimal clustering tree for unsupervised learning

Deep learning and k-means clustering in heterotic string vacua with line bundles

k-Means Clustering with Outlier Detection, Mixed Variables and Missing Values

An Analysis of K-Means, Particle Swarm Optimization and Genetic Algorithm with Data Clustering Technique