Active Learning for Duplicate Record Identification in Deep Web

Active learning is important for duplicate record identification since manually identifying a suitable set of labeled examples is difficult. The imbalance data problem for duplicate record identification, wherein the number of non-matches samples far exce

  • PDF / 503,227 Bytes
  • 10 Pages / 439.37 x 666.142 pts Page_size
  • 26 Downloads / 199 Views

DOWNLOAD

REPORT


Abstract Active learning is important for duplicate record identification since manually identifying a suitable set of labeled examples is difficult. The imbalance data problem for duplicate record identification, wherein the number of nonmatches samples far exceeds the number of matches samples, causes poor prediction performance for matches class. In this paper, we present a new active learning approach by taking the certainty, uncertainty and representativeness into account. Our method first trains two feature subspace classifiers and uses certainty classifier to generate a matches pool from which informative matches samples were selected for manual annotation by leveraging an uncertainty and density measurement, and meanwhile, non-matches samples are automatically labeled to reduce human annotation efforts. We include a detailed experimental evaluation on real-world data demonstrating the effectiveness of our algorithms. Keywords Deep web

 Duplicate record identification  Active learning

1 Introduction In the recent years, the web has been rapidly deepened with the prevalence of databases online [1]. On the Internet, there are so many pages that are generated dynamically according by back-end databases; such information may not be accessible through static URL links. They are assembled into web pages as responses to queries submitted through the ‘‘query interface’’ of an underlying database. Because current search engines cannot effectively ‘‘crawl’’ databases, such data are believed to be ‘‘invisible’’ and thus remain largely ‘‘hidden’’ from P. Zhao  J. Xin  X. Xian  Z. Cui (&) Department of Computer Science and Technology, Soochow University, Suzhou 215006, China e-mail: [email protected]

Z. Wen and T. Li (eds.), Foundations of Intelligent Systems, Advances in Intelligent Systems and Computing 277, DOI: 10.1007/978-3-642-54924-3_12,  Springer-Verlag Berlin Heidelberg 2014

125

126

P. Zhao et al.

users (thus often also referred to as the Deep web, invisible or hidden web.). Deep web is a concept relative to Surface web, proposed by Dr. Jill Ellsworth firstly in 1994. Information in Deep web is stored in databases, characterized abundant information, single theme, high quality, well structure and rapid increasing rate. Using overlap analysis between pairs of search engines, a July 2000 white paper [2] estimated 43,000–96,000 ‘‘Deep web sites’’ and an informal estimate of 7,500 terabytes of data 500 times larger than the Surface web. The Deep web has clearly rendered large-scale integration of a real necessity and a real challenge. Identifying the duplicate records from multiple web databases is one of the most key steps in Deep web data integration. In one domain (book, music, computer, etc.), there are often a large proportion of duplicate entities across web databases, so it is necessary to identify them for further applications, such as deduplication or price-comparison services. Active learning is a promising way for duplicate record identification to reduce the label cost. However, there is