Comparing clusterings using combination of the kappa statistic and entropy-based measure
- PDF / 885,762 Bytes
- 18 Pages / 439.37 x 666.142 pts Page_size
- 47 Downloads / 201 Views
Comparing clusterings using combination of the kappa statistic and entropy-based measure Evženie Uglickich1
· Ivan Nagy1,2 · Dominika Vlˇcková3
Received: 8 March 2019 / Accepted: 7 November 2019 / Published online: 16 November 2019 © Sapienza Università di Roma 2019
Abstract The paper focuses on a problem of comparing clusterings with the same number of clusters obtained as a result of using different clustering algorithms. It proposes a method of the evaluation of the agreement of clusterings based on the combination of the Cohen’s kappa statistic and the normalized mutual information. The main contributions of the proposed approach are: (i) the reliable use in practice in the case of a small fixed number of clusters, (ii) the suitability to comparing clusterings with a higher number of clusters in contrast with the original statistics, (iii) the independence on size of the data set and shape of clusters. Results of the experimental validation of the proposed statistic using both simulations and real data sets as well as the comparison with the theoretical counterparts are demonstrated. Keywords Comparing clusterings · Clusters agreement · κmax statistic · Normalized mutual information
1 Introduction This paper deals with a task of the evaluation of the agreement of clusters resulting from different methods of the cluster analysis. The cluster analysis is a highly demanded branch of the data mining area, known also as unsupervised learning [18]. It provides a considerable amount of algorithms directed at sorting data with similar attributes into groups called
B
Evženie Uglickich [email protected] Ivan Nagy [email protected] Dominika Vlˇcková [email protected]
1
Department of Signal Processing, The Czech Academy of Sciences, Institute of Information Theory and Automation, Pod vodárenskou vˇeží 4, 18208 Prague, Czech Republic
2
Faculty of Transportation Sciences, Czech Technical University, Na Florenci 25, 11000 Prague, Czech Republic
3
Faculty of Nuclear Sciences and Physical Engineering, Czech Technical University, Bˇrehová 7, 11519 Prague, Czech Republic
123
254
E. Uglickich et al.
clusters, see e.g., [11,16,55], etc. Clustering is required in many application fields of multivariate data analysis, including, for example, but not limited to bioinformatics [24,25], social fields [23,40], transportation sciences [46,48], fault detection [21,38], big data [13] and many others. Traditionally, clustering approaches are distinguished between hierarchical methods (divisive or agglomerative), e.g., [10,12], etc., and partitioning methods, which are further divided (with probable overlapping) among • centroid-methods such as famous k-means developed more than 50 years ago [43], but still used in many extensions overviewed in [16], k-medoids [11], etc.; • density-based methods, e.g., [34], etc.; • grid-based methods, e.g., [39], etc.; • clustering high-dimensional data and constraint-based methods [11]; • model-based clustering methods, such as, e.g., conceptual clustering [11], neural network approach
Data Loading...