A Semantic Comparison of Clustering Algorithms for the Evaluation of Web-Based Similarity Measures

The Internet explosion and the massive diffusion of mobile devices lead to the creation of a worldwide collaborative system, daily used by millions of users through search engines and application interfaces. New paradigms permit to calculate the similarit

PDF / 1,491,413 Bytes
15 Pages / 439.37 x 666.142 pts Page_size
87 Downloads / 302 Views

DOWNLOAD

REPORT

3

Department of Computer, Control, and Management Engineering, La Sapienza University of Rome, Rome, Italy 2 Department of Computer Science, Hong Kong Baptist University, Kowloon Tong, Hong Kong Department of Mathematics and Computer Science, University of Perugia, Perugia, Italy {valentina.franzoni,milani}@dmi.unipg.it

Abstract. The Internet explosion and the massive diﬀusion of mobile devices lead to the creation of a worldwide collaborative system, daily used by millions of users through search engines and application interfaces. New paradigms permit to calculate the similarity of terms using only the statistical information returned by a query, or from additional features; also old algorithms and measures have been applied to new domains and scopes, to eﬃciently ﬁnd words clusters from the Web. The problem of evaluating such techniques and algorithms in new domains emerges, and highlights a still open ﬁeld of experimentation. In this paper, preliminary tests have been held on diﬀerent semantic proximity measures (average conﬁdence, NGD, PMI, χ2 , PMING Distance), and diﬀerent clustering algorithms among the most used in literature have been compared (e.g. k-means, Expectation-Maximization, spectral clustering) for evaluating such measures. The suitability of the considered measures and methods to calculate the semantic proximity was veriﬁed at the state-of-art, and problems were identiﬁed, comparing the results of measurements to a ground truth provided by models of contextualized knowledge, clustering and human perception of semantic relations, which data are already studied in literature. Keywords: Data mining · Clustering · Semantic evaluation · Semantic similarity · Information retrieval

1

Introduction

One of the main problems that emerge in the classic approach to semantics is the diﬃculty in acquisition and maintenance of ontologies and semantic annotations [1,2,32]. On the other side, the ﬂow of Web documents is continuously fuelled by the collaborative contribution [4,5] of millions of users. The existing semantic models are expressive enough; on the other hand their basic limitation lies on c Springer International Publishing Switzerland 2016 O. Gervasi et al. (Eds.): ICCSA 2016, Part V, LNCS 9790, pp. 438–452, 2016. DOI: 10.1007/978-3-319-42092-9 34

A Semantic Comparison of Clustering Algorithms for the Evaluation

439

the inability of managing the evolution of ontological models and content annotations, which are not taken into account in the model itself. The lack of automation capabilities and evolutionary maintenance [6] is highly relevant, especially for the generation of context-based semantic annotations or focusing on speciﬁc social networks or repositories. Search engines, continually exploring the Web, are a natural source of semantic information on which to base a modern approach to semantic annotation [3,29,36]. A promising idea is that it is possible to generalize the semantic similarity, under the assumption that semantically similar terms behave similarly [7], and deﬁne collab

Data Loading...

A Semantic Comparison of Clustering Algorithms for the Evaluation of Web-Based Similarity Measures

Recommend Documents

Semantic Similarity Measures for Topological Link Prediction

A*-Based Similarity Assessment of Semantic Graphs

Similarity, Semantic

Evaluation of Similarity Measures in a Benchmark for Spanish Paraphrasing Detection

A semantic taxonomy for diversity measures

Distance and Similarity Measures

Distance and Similarity Measures

Synthesized Algorithms of Concept Similarity Based on the Semantic Correlation Prerequisite

Evaluating Similarity Measures for Dataset Search

A survey of density based clustering algorithms

A New Similarity Measures on Vague Sets

A Hybrid Approach for Improved Image Similarity Using Semantic Segmentation