An overview of distance and similarity functions for structured data

  • PDF / 2,516,402 Bytes
  • 43 Pages / 439.37 x 666.142 pts Page_size
  • 24 Downloads / 222 Views

DOWNLOAD

REPORT


An overview of distance and similarity functions for structured data Santiago Ontañón1,2

© Springer Nature B.V. 2020

Abstract The notions of distance and similarity play a key role in many machine learning approaches, and artificial intelligence in general, since they can serve as an organizing principle by which individuals classify objects, form concepts and make generalizations. While distance functions for propositional representations have been thoroughly studied, work on distance functions for structured representations, such as graphs, frames or logical clauses, has been carried out in different communities and is much less understood. Specifically, a significant amount of work that requires the use of a distance or similarity function for structured representations of data usually employs ad-hoc functions for specific applications. Therefore, the goal of this paper is to provide an overview of this work to identify connections between the work carried out in different areas and point out directions for future work. Keywords  Distance · Similarity · Structured data · Relational learning

1 Introduction The complementary notions of distance and similarity play a key role in many machine learning approaches, such as instance-based learning  (Aha et  al. 1991), kernel-based methods  (Vert et  al. 2004), case-based reasoning  (Aamodt and Plaza 1994), or clustering algorithms  (Ng et  al. 2002; Kaufman and Rousseeuw 1987). Distance and similarity functions are also relevant for artificial intelligence (AI) in general, since they can serve as an organizing principle by which individuals classify objects, form concepts and make generalizations  (Tversky 1977). Specifically this paper presents an overview of distance and similarity functions for structured representations of data, such as graphs or frames. While distance functions for propositional (i.e. feature-vector) representations have been thoroughly studied in the past, work on distance functions for structured representations has been carried out in different communities such as graph matching, inductive logic programming, case-based reasoning, relational learning or graph mining and is much less * Santiago Ontañón [email protected]; [email protected] 1

Google Research, Mountain View, CA 94043, USA

2

Drexel University, Philadelphia, PA 19104, USA



13

Vol.:(0123456789)

S. Ontañón

understood. Specifically, a significant amount of work that requires the use of a distance or similarity function for structured representations of data usually employs ad-hoc functions. Therefore, the goal of this paper is to provide an overview of this work in order to have a complete view of the field of distance functions for structure representations, and lay foundations for future work. Structured data representations are important, since, there are many real-world application domains for which data of interest is inherently structured and it is hard to represent it using a propositional representation. Consider, for example, a biomedical domain where we are in