ConDist: A Context-Driven Categorical Distance Measure

A distance measure between objects is a key requirement for many data mining tasks like clustering, classification or outlier detection. However, for objects characterized by categorical attributes, defining meaningful distance measures is a challenging t

PDF / 399,843 Bytes
16 Pages / 439.37 x 666.142 pts Page_size
108 Downloads / 199 Views

DOWNLOAD

REPORT

Faculty of Electrical Engineering and Informatics, Coburg University of Applied Sciences and Arts, 96450 Coburg, Germany {markus.ring,florian.otto,dieter.landes}@hs-coburg.de 2 Data Mining and Information Retrieval Group, University of W¨ urzburg, 97074 W¨ urzburg, Germany {becker,niebler,hotho}@informatik.uni-wuerzburg.de

Abstract. A distance measure between objects is a key requirement for many data mining tasks like clustering, classiﬁcation or outlier detection. However, for objects characterized by categorical attributes, deﬁning meaningful distance measures is a challenging task since the values within such attributes have no inherent order, especially without additional domain knowledge. In this paper, we propose an unsupervised distance measure for objects with categorical attributes based on the idea that categorical attribute values are similar if they appear with similar value distributions on correlated context attributes. Thus, the distance measure is automatically derived from the given data set. We compare our new distance measure to existing categorical distance measures and evaluate on diﬀerent data sets from the UCI machine-learning repository. The experiments show that our distance measure is recommendable, since it achieves similar or better results in a more robust way than previous approaches. Keywords: Categorical data · Distance measure · Heterogeneous data · Unsupervised learning

1

Introduction

Distance calculation between objects is a key requirement for many data mining tasks like clustering, classiﬁcation or outlier detection [13]. Objects are described by a set of attributes. For continuous attributes, the distance calculation is well understood and mostly the Minkowski distance is used [2]. For categorical attributes, deﬁning meaningful distance measures is more challenging since the values within such attributes have no inherent order [4]. The absence of additional domain knowledge further complicates this task. However, several methods exist to address this issue. Some are based on simple approaches like checking for equality and inequality of categorical values, or create a new binary attribute for each categorical value [2]. An obvious drawback of these two approaches is that they cannot reﬂect the degree of similarity c Springer International Publishing Switzerland 2015 A. Appice et al. (Eds.): ECML PKDD 2015, Part I, LNAI 9284, pp. 251–266, 2015. DOI: 10.1007/978-3-319-23528-8 16

252

M. Ring et al.

or dissimilarity between two distinct categorical values. Yet, more sophisticated methods incorporate statistical information about the data [6–8]. In this paper, we take the latter approach. In contrast to previous work, we take into account the quality of information that can be extracted from the data, in form of correlation between attributes. The resulting distance measure is called ConDist (Context based Categorical Distance Measure): We ﬁrst derive a distance measure for each attribute separately. To this end, we take advantage of the fact that categorical attributes a

Data Loading...

ConDist: A Context-Driven Categorical Distance Measure

Recommend Documents

Cluster and Distance Measure

A measure of interrater absolute agreement for ordinal categorical data

Categorical data

A functorial approach to categorical resolutions

Visualizing Categorical Data

Categorical Documentaries

Categorical Diagonalization

Universal Theory of Automata A Categorical Approach

A Categorical Duality for Semilattices and Lattices

Further Objections to Categorical Monism

Recognition at a Distance

Grounding at a distance