A Semantic Comparison of Clustering Algorithms for the Evaluation of Web-Based Similarity Measures

The Internet explosion and the massive diffusion of mobile devices lead to the creation of a worldwide collaborative system, daily used by millions of users through search engines and application interfaces. New paradigms permit to calculate the similarit

  • PDF / 1,491,413 Bytes
  • 15 Pages / 439.37 x 666.142 pts Page_size
  • 87 Downloads / 181 Views

DOWNLOAD

REPORT


3

Department of Computer, Control, and Management Engineering, La Sapienza University of Rome, Rome, Italy 2 Department of Computer Science, Hong Kong Baptist University, Kowloon Tong, Hong Kong Department of Mathematics and Computer Science, University of Perugia, Perugia, Italy {valentina.franzoni,milani}@dmi.unipg.it

Abstract. The Internet explosion and the massive diffusion of mobile devices lead to the creation of a worldwide collaborative system, daily used by millions of users through search engines and application interfaces. New paradigms permit to calculate the similarity of terms using only the statistical information returned by a query, or from additional features; also old algorithms and measures have been applied to new domains and scopes, to efficiently find words clusters from the Web. The problem of evaluating such techniques and algorithms in new domains emerges, and highlights a still open field of experimentation. In this paper, preliminary tests have been held on different semantic proximity measures (average confidence, NGD, PMI, χ2 , PMING Distance), and different clustering algorithms among the most used in literature have been compared (e.g. k-means, Expectation-Maximization, spectral clustering) for evaluating such measures. The suitability of the considered measures and methods to calculate the semantic proximity was verified at the state-of-art, and problems were identified, comparing the results of measurements to a ground truth provided by models of contextualized knowledge, clustering and human perception of semantic relations, which data are already studied in literature. Keywords: Data mining · Clustering · Semantic evaluation · Semantic similarity · Information retrieval

1

Introduction

One of the main problems that emerge in the classic approach to semantics is the difficulty in acquisition and maintenance of ontologies and semantic annotations [1,2,32]. On the other side, the flow of Web documents is continuously fuelled by the collaborative contribution [4,5] of millions of users. The existing semantic models are expressive enough; on the other hand their basic limitation lies on c Springer International Publishing Switzerland 2016  O. Gervasi et al. (Eds.): ICCSA 2016, Part V, LNCS 9790, pp. 438–452, 2016. DOI: 10.1007/978-3-319-42092-9 34

A Semantic Comparison of Clustering Algorithms for the Evaluation

439

the inability of managing the evolution of ontological models and content annotations, which are not taken into account in the model itself. The lack of automation capabilities and evolutionary maintenance [6] is highly relevant, especially for the generation of context-based semantic annotations or focusing on specific social networks or repositories. Search engines, continually exploring the Web, are a natural source of semantic information on which to base a modern approach to semantic annotation [3,29,36]. A promising idea is that it is possible to generalize the semantic similarity, under the assumption that semantically similar terms behave similarly [7], and define collab