Efficient processing of graph similarity queries with edit distance constraints

  • PDF / 949,036 Bytes
  • 26 Pages / 595.276 x 790.866 pts Page_size
  • 30 Downloads / 230 Views

DOWNLOAD

REPORT


REGULAR PAPER

Efficient processing of graph similarity queries with edit distance constraints Xiang Zhao · Chuan Xiao · Xuemin Lin · Wei Wang · Yoshiharu Ishikawa

Received: 13 June 2012 / Revised: 12 January 2013 / Accepted: 17 January 2013 © Springer-Verlag Berlin Heidelberg 2013

Abstract Graphs are widely used to model complicated data semantics in many applications in bioinformatics, chemistry, social networks, pattern recognition, etc. A recent trend is to tolerate noise arising from various sources such as erroneous data entries and find similarity matches. In this paper, we study graph similarity queries with edit distance constraints. Inspired by the q-gram idea for string similarity problems, our solution extracts paths from graphs as features for indexing. We establish a lower bound of common features to generate candidates. Efficient algorithms are proposed to handle three types of graph similarity queries by exploiting both matching and mismatching features as well as degree information to improve the filtering and verification on candidates. We demonstrate the proposed algorithms significantly outperform existing approaches with extensive experiments on real and synthetic datasets.

Electronic supplementary material The online version of this article (doi:10.1007/s00778-013-0306-1) contains supplementary material, which is available to authorized users. X. Zhao (B) · X. Lin · W. Wang The University of New South Wales, Sydney, Australia e-mail: [email protected] X. Lin e-mail: [email protected] W. Wang e-mail: [email protected] X. Zhao NICTA, Sydney, Australia C. Xiao · Y. Ishikawa Nagoya University, Nagoya, Japan e-mail: [email protected] Y. Ishikawa e-mail: [email protected]

Keywords

Graph similarity query · Edit distance · q-Gram

1 Introduction Graphs have a wide range of applications and have been utilized to model complex data in biological and chemical information systems, multimedia, social networks, etc. There has been considerable interest in many fundamental problems in analyzing graphs. Various algorithms are devised to solve the problems, including graph pattern mining [24,34,43], graph containment search and indexing [5,35,41], etc. Due to the existence of noise and inconsistency in data, a recent trend is to study similarity matches among graphs [22,23,26,27,31,32,36,38]. This body of work solves the problem of searching for graphs in a database that approximately contain or are contained by a query. Among the various graph similarity measures used in these studies, graph edit distance [3,21] has been widely accepted for representing distances between graphs. Compared with alternative distance or similarity measures, graph edit distance has three advantages: (1) It allows changes in both vertices and edges; (2) it reflects the topological information of graphs; and (3) it is a metric that can be applied to any type of graphs. Due to these elegant properties, graph edit distance has been used in the context of classification and clustering tasks in various application d