Labeling Data Extracted from the Web

We consider finding descriptive labels for anonymous, structured datasets, such as those produced by state-of-the-art Web wrappers. We give a probabilistic model to estimate the affinity between attributes and labels, and describe a method that uses a Web

  • PDF / 777,396 Bytes
  • 18 Pages / 430 x 660 pts Page_size
  • 106 Downloads / 198 Views

DOWNLOAD

REPORT


Universidade Federal do Amazonas Manaus, AM, Brazil {alti,john,msevalho}@dcc.ufam.edu.br 2 University of Calgary Calgary, AB, Canada [email protected]

Abstract. We consider finding descriptive labels for anonymous, structured datasets, such as those produced by state-of-the-art Web wrappers. We give a probabilistic model to estimate the affinity between attributes and labels, and describe a method that uses a Web search engine to populate the model. We discuss a method for finding good candidate labels for unlabeled datasets. Ours is the first unsupervised labeling method that does not rely on mining the HTML pages containing the data. Experimental results with data from 8 different domains show that our methods achieve high accuracy even with very few search engine accesses.

1

Introduction

The Web is a vast, albeit disorganized, source of valuable information. To extract such information into a format suitable for use by other applications, several Web wrappers have been proposed. However, these methods [1,3,16] recognize only the structure, but not the semantics, of the Web data: They produce anonymous datasets (i.e., datasets with meaningless labels in their schema). This is unfortunate, as data integration tools often rely on the existence of meaningful labels in the schema [11]. In face of these limitations, other authors have proposed methods for labeling anonymous data extracted by Web wrappers [2,4,13]. In general, these methods work by mining terms with distinctive formatting within the original pages containing the data. While high accuracy is sometimes achieved (the authors of [2] report up to 90% accuracy), this approach has two drawbacks. First, typical Web pages often omit labels, which are understood from the context (by a human). For instance, the book description in Figure 1 contains some labels (e.g., ISBN), while others are missing (e.g, title and publisher). Second, and more importantly, this approach restricts one to using only those labels chosen by the Web content providers, which may not be the most appropriate or most descriptive ones. We propose a novel and highly effective method for automatically labeling anonymous data based on a simple probabilistic model that takes into account the affinity between a set of values (i.e., an anonymous attribute) and potential attribute labels. The probabilities are estimated by counting the number of answers to speculative queries, obtained from a standard Web search engine. Intuitively, a speculative query formulates a hypothesis that a given term is a good R. Meersman and Z. Tari et al. (Eds.): OTM 2007, Part I, LNCS 4803, pp. 1099–1116, 2007. c Springer-Verlag Berlin Heidelberg 2007 

1100

A.S. da Silva et al.

Fig. 1. Book description extracted from http://www.bookpool.com

label for an attribute in the anonymous dataset. The search engine is used as an oracle to determine how plausible such a hypothesis is. Unlike previous methods, our method is oblivious to where the candidate labels come from. Also, we give a fully automatic method for finding go