Measures of Dispersion and Cluster-Trees for Categorical Data

A clustering algorithm, in essence, is characterized by two features (1) the way in which the heterogeneity within resp. between clusters is measured (objective function) (2) the steps in which the splitting resp. fusioning proceeds. For categorical data

  • PDF / 207,340 Bytes
  • 8 Pages / 439.37 x 666.142 pts Page_size
  • 41 Downloads / 254 Views

DOWNLOAD

REPORT


ta sets in business administration show attributes of mixed type i.e. numerical and categorical ones. The classical text-book advice to cluster data of this kind can be summarized as follows a) Measure (dis-)similarities among attribute vectors separately on the basis of either kind of attributes and unite both the resulting numbers in a (possibly weighted) sum. b) In order to deal with the categorical attributes, encode them in a suitable (binary) way and look for coincidences all over the resulting vectors. Condense your findings with the help of one of the numerous existing matching coefficients. (cf. Fahrmeir et al. (1996), p. 453). This advice, however, is bad policy for at least two reasons. Treating both parts of the attribute vectors separately amounts to saying that both groups of variables are independent—which only can be claimed in exceptional cases. By looking for bit-wise coincidences, as in step two, one completely looses contact with the individual attributes. This feature, too, is statistically undesirable. For that reason it seems to be less harmful to categorize numerical quantities

164

Ulrich Müller-Funk

and to deal with all variables simultaneously—but to avoid matching coefficients and the like. During the last decade roughly a dozen agglomerative or partitioning cluster algorithms for categorical data have been proposed, quite a few based on the concept of entropy. Examples include “COOLCAT” (Barbará et al. (2002)) or “LIMBO” (Andritsos et al. (2004)). These approaches, no doubt, have their merits. For various reasons, however, it would be advantageous to rely on a divisive, treestructured technique that a) supports the choice of relevant factors, b) helps to identify the resulting clusters and renders the device communicable to practitioners. In other words, we favour some unsupervised analogue to CART or CHAID. That type of procedure, furthermore, facilitates the use of prior information on the attribute level as it will be seen in Section 3. Within that context comparisons of attributes should not be based any longer on similarity-measures but on quantities that allow for a model-equivalent and accordingly, can be related to the underlying probability source. For that purpose we shall work out the concept of “dispersion” in Section 2 and discuss starting points for cluster algorithms in Section 3. The material in Section 2 may bewilder some readers as it seems that “somebody should have written down something like that long time ago”. Despite some efforts, however, no source in the literature could be spotted. There is another important aspect that has to be addressed. Categorical data is typically organized in form of tables or cubes. Obviously, the number of cells exponentially increases with the number of factors taken into consideration. This, in turn, will result in many empty or sparsely populated cells and render the analysis obsolete. In order to circumvent this difficulty, some form of “sequential sub-cube clustering” is needed (and will be reported elsewhere).

2 Measures of di