Clustering of modal-valued symbolic data
- PDF / 1,450,500 Bytes
- 29 Pages / 439.37 x 666.142 pts Page_size
- 75 Downloads / 214 Views
Clustering of modal-valued symbolic data Nataša Kejžar1
2 · Vladimir Batagelj3,4,5 ˇ · Simona Korenjak-Cerne
Received: 12 August 2014 / Revised: 20 August 2020 / Accepted: 12 October 2020 © Springer-Verlag GmbH Germany, part of Springer Nature 2020
Abstract Symbolic data analysis is based on special descriptions of data known as symbolic objects (SOs). Such descriptions preserve more detailed information about units and their clusters than the usual representations with mean values. A special type of SO is a representation with frequency or probability distributions (modal values). This representation enables us to simultaneously consider variables of all measurement types during the clustering process. In this paper, we present the theoretical basis for compatible leaders and agglomerative clustering methods with alternative dissimilarities for modal-valued SOs. The leaders method efficiently solves clustering problems with large numbers of units, while the agglomerative method can be applied either alone to a small data set, or to leaders, obtained from the compatible leaders clustering method. We focus on (a) the inclusion of weights that enables clustering representatives to retain the same structure as if clustering only first order units and (b) the selection of relative dissimilarities that produce more interpretable, i.e., meaningful optimal clustering representatives. The usefulness of the proposed methods with adaptations was assessed and substantiated by carefully constructed simulation settings and demonstrated on three different real-world data sets gaining in interpretability from the use of weights (population pyramids and ESS data) or relative dissimilarity (US patents data). Keywords Symbolic objects · Leaders method · Hierarchical clustering · Ward’s method · Clustering demographic structures · United States Patents data set · European social survey data set Mathematics Subject Classification 62H30 · 91C20 · 62-07 · 68T10
Electronic supplementary material The online version of this article (https://doi.org/10.1007/s11634020-00425-4) contains supplementary material, which is available to authorized users. Extended author information available on the last page of the article
123
N. Kejžar et al.
1 Introduction In traditional data analysis a unit is usually described with a list of numerical, ordinal or nominal values of selected variables. In symbolic data analysis (SDA) a unit of a data set can be represented, for each variable, with a more detailed description than only a single value. Such structured descriptions are usually called symbolic objects (SOs) (Bock and Diday 2000; Billard and Diday 2006). A special type of SO consists of descriptions with frequency or probability distributions. In this way, we can simultaneously consider both single-value variables and variables with richer descriptions. Computerization of data gathering worldwide has resulted in enormous data sets. The predefined aggregation (preclustering) of raw data is becoming a common method to preserve as much informati
Data Loading...