Labeling Data Extracted from the Web

We consider finding descriptive labels for anonymous, structured datasets, such as those produced by state-of-the-art Web wrappers. We give a probabilistic model to estimate the affinity between attributes and labels, and describe a method that uses a Web

PDF / 777,396 Bytes
18 Pages / 430 x 660 pts Page_size
106 Downloads / 221 Views

DOWNLOAD

REPORT

Universidade Federal do Amazonas Manaus, AM, Brazil {alti,john,msevalho}@dcc.ufam.edu.br 2 University of Calgary Calgary, AB, Canada [email protected]

Abstract. We consider ﬁnding descriptive labels for anonymous, structured datasets, such as those produced by state-of-the-art Web wrappers. We give a probabilistic model to estimate the aﬃnity between attributes and labels, and describe a method that uses a Web search engine to populate the model. We discuss a method for ﬁnding good candidate labels for unlabeled datasets. Ours is the ﬁrst unsupervised labeling method that does not rely on mining the HTML pages containing the data. Experimental results with data from 8 diﬀerent domains show that our methods achieve high accuracy even with very few search engine accesses.

1

Introduction

The Web is a vast, albeit disorganized, source of valuable information. To extract such information into a format suitable for use by other applications, several Web wrappers have been proposed. However, these methods [1,3,16] recognize only the structure, but not the semantics, of the Web data: They produce anonymous datasets (i.e., datasets with meaningless labels in their schema). This is unfortunate, as data integration tools often rely on the existence of meaningful labels in the schema [11]. In face of these limitations, other authors have proposed methods for labeling anonymous data extracted by Web wrappers [2,4,13]. In general, these methods work by mining terms with distinctive formatting within the original pages containing the data. While high accuracy is sometimes achieved (the authors of [2] report up to 90% accuracy), this approach has two drawbacks. First, typical Web pages often omit labels, which are understood from the context (by a human). For instance, the book description in Figure 1 contains some labels (e.g., ISBN), while others are missing (e.g, title and publisher). Second, and more importantly, this approach restricts one to using only those labels chosen by the Web content providers, which may not be the most appropriate or most descriptive ones. We propose a novel and highly eﬀective method for automatically labeling anonymous data based on a simple probabilistic model that takes into account the aﬃnity between a set of values (i.e., an anonymous attribute) and potential attribute labels. The probabilities are estimated by counting the number of answers to speculative queries, obtained from a standard Web search engine. Intuitively, a speculative query formulates a hypothesis that a given term is a good R. Meersman and Z. Tari et al. (Eds.): OTM 2007, Part I, LNCS 4803, pp. 1099–1116, 2007. c Springer-Verlag Berlin Heidelberg 2007

1100

A.S. da Silva et al.

Fig. 1. Book description extracted from http://www.bookpool.com

label for an attribute in the anonymous dataset. The search engine is used as an oracle to determine how plausible such a hypothesis is. Unlike previous methods, our method is oblivious to where the candidate labels come from. Also, we give a fully automatic method for ﬁnding go

Data Loading...

Labeling Data Extracted from the Web

Recommend Documents

ONOMATOPEDIA: Onomatopoeia Online Example Dictionary System Extracted from Data on the Web

The Sun Recorded Through History Scientific Data Extracted from Hist

The Web of Data

Getting Structured Data from the Internet Running Web Crawlers/S

Web Services and the Semantic Web for Life Science Data

Web Data Management

Web Data Extraction

Modern Web Data Patterns

Web of Data

Visual Web Data Extraction

Web Data Extraction System

Protecting the Web from Misinformation