SETJoin: a novel top- k similarity join algorithm

  • PDF / 1,133,167 Bytes
  • 16 Pages / 595.276 x 790.866 pts Page_size
  • 93 Downloads / 188 Views

DOWNLOAD

REPORT


METHODOLOGIES AND APPLICATION

SETJoin: a novel top-k similarity join algorithm Hongya Wang1 · Lihong Yang1 · Yingyuan Xiao2

© Springer-Verlag GmbH Germany, part of Springer Nature 2020

Abstract As an important operation in data cleaning, near duplicate Web pages detection and data mining, similarity joins have received much attention recently. Existing similarity joins fall into two broad categories—the similarity-threshold-based similarity join and top-k similarity join (TopkJoin). Compared with the traditional one, TopkJoin is more suitable for cases where the similarity threshold is unknown before hand. In this paper, we focus on the performance optimization problem of TopkJoin. Particularly, we observed that the state-of-the-art TopkJoin algorithm has three serious performance issues, i.e., the inappropriate application of hash table, inefficient use of suffix filtering and unnecessary evaluation of excessive unqualified candidates. To resolve these problems, we proposed a novel algorithm, SETJoin, by combining the existing eventdriven framework with three simple yet efficient optimization techniques, viz., (1) reducing the cost in hashing by rearranging the orders of the candidate filtering and hash table lookup operations; (2) maximizing the pruning capability of suffix filtering by judiciously choosing the (near) optimal recursion depth; and (3) terminating join operations earlier by setting a much tighter stop condition for iteration. The experimental results show that SETJoin achieves up to 1.26x–3.49x speedup over the state-of-the-art algorithm on several real datasets. Keywords Set similarity join · Query processing · Candidate filtering

1 Introduction Similarity joins have a wide range of applications in domains such as data cleaning (Hernández and Stolfo 1998), near duplicate Web page detection (Bayardo et al. 2007) and data mining (Baraglia et al. 2010). For example, recommendation algorithms often need to compute pair-wise similarity among users or items and then make a recommendation to users who share similar interests. In data cleaning tasks, similarity join can serve as a primitive operation to identify different (but similar) representations of the same entity. Similarity joins have attracted much attention in recent years (Arasu et al. 2006; Xiao et al. 2008; Lam et al. 2010; Communicated by V. Loia.

B

Hongya Wang [email protected] Yingyuan Xiao [email protected]

1

School of Computer Science and Technology, Donghua University, Shanghai, China

2

School of Computer Science and Technology, Tianjin University of Technology, Tianjin, China

Bayardo et al. 2007; Jiang et al. 2014; Chaudhuri et al. 2006; Li et al. 2015). Existing similarity joins fall into two broad categories—the similarity-threshold-based similarity join (SimJoin) and top-k similarity join (TopkJoin). SimJoin returns pairs of records whose similarities are no less than a user-specified similarity threshold, whereas TopkJoin computes the k most similar record pairs under the given similarity function. SimJoin assumes that