Using Clustering and Edit Distance Techniques for Automatic Web Data Extraction
Many web sources provide access to an underlying database containing structured data. These data can be usually accessed in HTML form only, which makes it difficult for software programs to obtain them in structured form. Nevertheless, web sources usually
- PDF / 551,598 Bytes
- 13 Pages / 430 x 660 pts Page_size
- 96 Downloads / 181 Views
. Many web sources provide access to an underlying database containing structured data. These data can be usually accessed in HTML form only, which makes it difficult for software programs to obtain them in structured form. Nevertheless, web sources usually encode data records using a consistent template or layout, and the implicit regularities in the template can be used to automatically infer the structure and extract the data. In this paper, we propose a set of novel techniques to address this problem. While several previous works have addressed the same problem, most of them require multiple input pages while our method requires only one. In addition, previous methods make some assumptions about how data records are encoded into web pages, which do not always hold in real websites. Finally, we have tested our techniques with a high number of real web sources and we have found them to be very effective.
1 Introduction In today’s Web, there are many sites providing access to structured data contained in an underlying database. Typically, these sources, known as “semi-structured” web sources, provide some kind of HTML form that allows issuing queries against the database, and they return the query results embedded in HTML pages conforming to a certain fixed template. For instance, Fig. 1 shows a page containing a list of data records, representing the information about books in an Internet shop. Allowing software programs to access these structured data is useful for a variety of purposes. For instance, it allows data integration applications to access web information in a manner similar to a database. It also allows information gathering applications to store the retrieved information maintaining its structure and, therefore, allowing more sophisticated processing. Several approaches have been reported in the literature for building and maintaining “wrappers” for semi-structured web sources ([2][9][11][12][13]; [7] provides a brief survey). Although wrappers have been successfully used for many ∗ ∗∗
This research was partially supported by the Spanish Ministry of Education and Science under project TSI2005-07730.
Alberto Pan’s work was partially supported by the “Ramón y Cajal” programme of the Spanish Ministry of Education and Science.
B. Benatallah et al. (Eds.): WISE 2007, LNCS 4831, pp. 212–224, 2007. © Springer-Verlag Berlin Heidelberg 2007
Using Clustering and Edit Distance Techniques for Automatic Web Data Extraction
213
web data extraction and automation tasks, this approach has the inherent limitation that the target data sources must be known in advance. This is not possible in all cases. Consider, for instance, the case of “focused crawling” applications [3], which automatically crawl the web looking for topic-specific information. Several automatic methods for web data extraction have been also proposed in the literature [1][4][5][14], but they present several limitations. First, [1][5] require multiple pages generated using the same template as input. This can be inconvenient because a sufficie
Data Loading...