A self-verifying clustering approach to unsupervised matching of product titles

  • PDF / 3,807,644 Bytes
  • 44 Pages / 439.37 x 666.142 pts Page_size
  • 105 Downloads / 162 Views

DOWNLOAD

REPORT


A self‑verifying clustering approach to unsupervised matching of product titles Leonidas Akritidis1,2   · Athanasios Fevgas2 · Panayiotis Bozanis1,2 · Christos Makris3

© Springer Nature B.V. 2020

Abstract The continuous growth of the e-commerce industry has rendered the problem of product retrieval particularly important. As more enterprises move their activities on the Web, the volume and the diversity of the product-related information increase quickly. These factors make it difficult for the users to identify and compare the features of their desired products. Recent studies proved that the standard similarity metrics cannot effectively identify identical products, since similar titles often refer to different products and vice-versa. Other studies employ external data sources to enrich the titles; these solutions are rather impractical, since the process of fetching external data is inefficient. In this paper we introduce UPM, an unsupervised algorithm for matching products by their titles that is independent of any external sources. UPM consists of three stages. During the first stage, the algorithm analyzes the titles and extracts combinations of words out of them. These combinations are evaluated in stage 2 according to several criteria, and the most appropriate of them are selected to form the initial clusters. The third phase is a post-processing verification stage that refines the initial clusters by correcting the erroneous matches. This stage is designed to operate in combination with all clustering approaches, especially when the data possess properties that prevent the co-existence of two data points within the same cluster. The experimental evaluation of UPM with multiple datasets demonstrates its superiority against the state-of-the-art clustering approaches and string similarity metrics, in terms of both efficiency and effectiveness. Keywords  Product matching · Entity matching · Entity resolution · Clustering · Unsupervised learning · Data mining

1 Introduction The online comparison of products is a crucial process, since it is usually the first step in the life cycle of an electronic sale. Before a purchase is completed, the majority of users search, collect, and aggregate the characteristics of both the desired and similar products, if any. For this reason, the role of the product comparison services has been * Leonidas Akritidis [email protected] Extended author information available on the last page of the article

13

Vol.:(0123456789)



L. Akritidis et al.

rendered increasingly important. These platforms retrieve data from various sources, including electronic stores, suppliers, and reviews sites, and merge the information that refers to identical products. In the sequel, they present this information to their users, allowing them to compare a variety of parameters such as features and prices. They also facilitate the aggregation of user opinions and reviews. Since the product related data originates from multiple sources, it presents a high degree of diversity. To implement their comp