Comparing clusterings using combination of the kappa statistic and entropy-based measure

PDF / 885,762 Bytes
18 Pages / 439.37 x 666.142 pts Page_size
47 Downloads / 234 Views

Comparing clusterings using combination of the kappa statistic and entropy-based measure Evženie Uglickich1

· Ivan Nagy1,2 · Dominika Vlˇcková3

Received: 8 March 2019 / Accepted: 7 November 2019 / Published online: 16 November 2019 © Sapienza Università di Roma 2019

Abstract The paper focuses on a problem of comparing clusterings with the same number of clusters obtained as a result of using different clustering algorithms. It proposes a method of the evaluation of the agreement of clusterings based on the combination of the Cohen’s kappa statistic and the normalized mutual information. The main contributions of the proposed approach are: (i) the reliable use in practice in the case of a small fixed number of clusters, (ii) the suitability to comparing clusterings with a higher number of clusters in contrast with the original statistics, (iii) the independence on size of the data set and shape of clusters. Results of the experimental validation of the proposed statistic using both simulations and real data sets as well as the comparison with the theoretical counterparts are demonstrated. Keywords Comparing clusterings · Clusters agreement · κmax statistic · Normalized mutual information

1 Introduction This paper deals with a task of the evaluation of the agreement of clusters resulting from different methods of the cluster analysis. The cluster analysis is a highly demanded branch of the data mining area, known also as unsupervised learning [18]. It provides a considerable amount of algorithms directed at sorting data with similar attributes into groups called

B

Evženie Uglickich [email protected] Ivan Nagy [email protected] Dominika Vlˇcková [email protected]

1

Department of Signal Processing, The Czech Academy of Sciences, Institute of Information Theory and Automation, Pod vodárenskou vˇeží 4, 18208 Prague, Czech Republic

2

Faculty of Transportation Sciences, Czech Technical University, Na Florenci 25, 11000 Prague, Czech Republic

3

Faculty of Nuclear Sciences and Physical Engineering, Czech Technical University, Bˇrehová 7, 11519 Prague, Czech Republic

123

254

E. Uglickich et al.

clusters, see e.g., [11,16,55], etc. Clustering is required in many application fields of multivariate data analysis, including, for example, but not limited to bioinformatics [24,25], social fields [23,40], transportation sciences [46,48], fault detection [21,38], big data [13] and many others. Traditionally, clustering approaches are distinguished between hierarchical methods (divisive or agglomerative), e.g., [10,12], etc., and partitioning methods, which are further divided (with probable overlapping) among • centroid-methods such as famous k-means developed more than 50 years ago [43], but still used in many extensions overviewed in [16], k-medoids [11], etc.; • density-based methods, e.g., [34], etc.; • grid-based methods, e.g., [39], etc.; • clustering high-dimensional data and constraint-based methods [11]; • model-based clustering methods, such as, e.g., conceptual clustering [11], neural network approach

Data Loading...

Comparing clusterings using combination of the kappa statistic and entropy-based measure

Recommend Documents

Kappa

Comparing and Clustering Residential Layouts Using a Novel Measure of Grating Difference

Scan Statistic

Test Statistic

s-AWARE: Supervised Measure-Based Methods for Crowd-Assessors Combination

Bioequivalence Study Comparing Fixed-Dose Combination of Clopidogrel and Aspirin with Coadministration of Individual For

Statistic Feature Extraction

Comparing static and dynamic measures of affect intensity and affective lability: do they measure the same thing?

Characterisation of SNP haplotype structure in chemokine and chemokine receptor genes using CEPH pedigrees and statistic

Correcting the t statistic for measurement error

Phi Beta Kappa Society

Comparing field data using Alpert multi-wavelets