Active Learning for Duplicate Record Identification in Deep Web

Active learning is important for duplicate record identification since manually identifying a suitable set of labeled examples is difficult. The imbalance data problem for duplicate record identification, wherein the number of non-matches samples far exce

PDF / 503,227 Bytes
10 Pages / 439.37 x 666.142 pts Page_size
26 Downloads / 216 Views

DOWNLOAD

REPORT

Abstract Active learning is important for duplicate record identification since manually identifying a suitable set of labeled examples is difficult. The imbalance data problem for duplicate record identification, wherein the number of nonmatches samples far exceeds the number of matches samples, causes poor prediction performance for matches class. In this paper, we present a new active learning approach by taking the certainty, uncertainty and representativeness into account. Our method first trains two feature subspace classifiers and uses certainty classifier to generate a matches pool from which informative matches samples were selected for manual annotation by leveraging an uncertainty and density measurement, and meanwhile, non-matches samples are automatically labeled to reduce human annotation efforts. We include a detailed experimental evaluation on real-world data demonstrating the effectiveness of our algorithms. Keywords Deep web

Duplicate record identification Active learning

1 Introduction In the recent years, the web has been rapidly deepened with the prevalence of databases online [1]. On the Internet, there are so many pages that are generated dynamically according by back-end databases; such information may not be accessible through static URL links. They are assembled into web pages as responses to queries submitted through the ‘‘query interface’’ of an underlying database. Because current search engines cannot effectively ‘‘crawl’’ databases, such data are believed to be ‘‘invisible’’ and thus remain largely ‘‘hidden’’ from P. Zhao J. Xin X. Xian Z. Cui (&) Department of Computer Science and Technology, Soochow University, Suzhou 215006, China e-mail: [email protected]

Z. Wen and T. Li (eds.), Foundations of Intelligent Systems, Advances in Intelligent Systems and Computing 277, DOI: 10.1007/978-3-642-54924-3_12, Springer-Verlag Berlin Heidelberg 2014

125

126

P. Zhao et al.

users (thus often also referred to as the Deep web, invisible or hidden web.). Deep web is a concept relative to Surface web, proposed by Dr. Jill Ellsworth firstly in 1994. Information in Deep web is stored in databases, characterized abundant information, single theme, high quality, well structure and rapid increasing rate. Using overlap analysis between pairs of search engines, a July 2000 white paper [2] estimated 43,000–96,000 ‘‘Deep web sites’’ and an informal estimate of 7,500 terabytes of data 500 times larger than the Surface web. The Deep web has clearly rendered large-scale integration of a real necessity and a real challenge. Identifying the duplicate records from multiple web databases is one of the most key steps in Deep web data integration. In one domain (book, music, computer, etc.), there are often a large proportion of duplicate entities across web databases, so it is necessary to identify them for further applications, such as deduplication or price-comparison services. Active learning is a promising way for duplicate record identification to reduce the label cost. However, there is

Data Loading...

Active Learning for Duplicate Record Identification in Deep Web

Recommend Documents

Dual Adversarial Network for Deep Active Learning

Budget Active Learning for Deep Networks

Deep Learning in Malware Identification and Classification

Active deep Q-learning with demonstration

A Web Application for Feral Cat Recognition Through Deep Learning

Substep active deep learning framework for image classification

Deep Active Learning for Effective Pulmonary Nodule Detection

Deep Reinforcement Active Learning for Medical Image Classification

Deep Active Learning for Breast Cancer Segmentation on Immunohistochemistry Images

Deep Active Learning with Simulated Rationales for Text Classification

IoT Device Identification Using Deep Learning

Identification of Plant Species Using Deep Learning