Clustering of modal-valued symbolic data

PDF / 1,450,500 Bytes
29 Pages / 439.37 x 666.142 pts Page_size
75 Downloads / 277 Views

Clustering of modal-valued symbolic data Nataša Kejžar1

2 · Vladimir Batagelj3,4,5 ˇ · Simona Korenjak-Cerne

Received: 12 August 2014 / Revised: 20 August 2020 / Accepted: 12 October 2020 © Springer-Verlag GmbH Germany, part of Springer Nature 2020

Abstract Symbolic data analysis is based on special descriptions of data known as symbolic objects (SOs). Such descriptions preserve more detailed information about units and their clusters than the usual representations with mean values. A special type of SO is a representation with frequency or probability distributions (modal values). This representation enables us to simultaneously consider variables of all measurement types during the clustering process. In this paper, we present the theoretical basis for compatible leaders and agglomerative clustering methods with alternative dissimilarities for modal-valued SOs. The leaders method efficiently solves clustering problems with large numbers of units, while the agglomerative method can be applied either alone to a small data set, or to leaders, obtained from the compatible leaders clustering method. We focus on (a) the inclusion of weights that enables clustering representatives to retain the same structure as if clustering only first order units and (b) the selection of relative dissimilarities that produce more interpretable, i.e., meaningful optimal clustering representatives. The usefulness of the proposed methods with adaptations was assessed and substantiated by carefully constructed simulation settings and demonstrated on three different real-world data sets gaining in interpretability from the use of weights (population pyramids and ESS data) or relative dissimilarity (US patents data). Keywords Symbolic objects · Leaders method · Hierarchical clustering · Ward’s method · Clustering demographic structures · United States Patents data set · European social survey data set Mathematics Subject Classification 62H30 · 91C20 · 62-07 · 68T10

Electronic supplementary material The online version of this article (https://doi.org/10.1007/s11634020-00425-4) contains supplementary material, which is available to authorized users. Extended author information available on the last page of the article

123

N. Kejžar et al.

1 Introduction In traditional data analysis a unit is usually described with a list of numerical, ordinal or nominal values of selected variables. In symbolic data analysis (SDA) a unit of a data set can be represented, for each variable, with a more detailed description than only a single value. Such structured descriptions are usually called symbolic objects (SOs) (Bock and Diday 2000; Billard and Diday 2006). A special type of SO consists of descriptions with frequency or probability distributions. In this way, we can simultaneously consider both single-value variables and variables with richer descriptions. Computerization of data gathering worldwide has resulted in enormous data sets. The predefined aggregation (preclustering) of raw data is becoming a common method to preserve as much informati

Data Loading...

Clustering of modal-valued symbolic data

Recommend Documents

Symbolic Clustering of Constrained Probabilistic Data

Symbolic Data Analysis Approach to Clustering Large Datasets

Data Clustering

Big Data and Clustering

Sensor Data Interpretation for Symbolic Analysis

Visualising and Clustering Video Data

Data-driven discovery of formulas by symbolic regression

Clustering and Symbolic Analysis of Cardiovascular Signals: Discovery and Visualization of Medically Relevant Patterns i

Integrating Symbolic and Sub-symbolic Reasoning

Clustering Imputation for Air Pollution Data

Reference, Symbolic

Symbolic Graphic