Application of Logic Wrappers to Hierarchical Data Extraction from HTML

Logic wrappers combine logic programming paradigm with efficient XML processing for data extraction from HTML. In this note we show how logic wrappers technology can be adapted to cope with hierarchical data extraction. For this purpose we introduce hiera

PDF / 405,969 Bytes
10 Pages / 430 x 660 pts Page_size
108 Downloads / 247 Views

DOWNLOAD

REPORT

iversity of Craiova, Business Information Systems Department A.I.Cuza 13, Craiova, RO-200585, Romania [email protected] 2 University of Craiova, Software Engineering Department Bvd.Decebal 107, Craiova, RO-200440, Romania {badica costin,popescu elvira}@software.ucv.ro Abstract. Logic wrappers combine logic programming paradigm with efficient XML processing for data extraction from HTML. In this note we show how logic wrappers technology can be adapted to cope with hierarchical data extraction. For this purpose we introduce hierarchical logic wrappers and illustrate their application by means of an intuitive example.

1 Introduction The Web is extensively used for information dissemination to humans and businesses. For this purpose Web technologies are used to convert data from internal formats, usually specific to data base management systems, to suitable presentations for attracting human users. However, the interest has rapidly shifted to make that information available for machine consumption by realizing that Web data can be reused for various problem solving purposes including common tasks like searching and filtering, and also more complex tasks like analysis, decision making, reasoning and integration. Two emergent technologies that have been put forward to enable automated processing of information available on the Web are semantic markup [14] and Web services [15]. Note however that most of the current practices in Web publishing are still based on the combination of traditional HTML – lingua franca for Web publishing [10], with server-side dynamic content generation from databases. Moreover, many Web pages are using HTML elements that were originally intended for use to structure content (e.g. those elements related to tables), for layout and presentation effects, even if this practice is not encouraged in theory. Therefore, techniques developed in areas like information extraction, machine learning and wrapper induction are expected to play a significant role in tackling the problem of Web data extraction. Research in this area resulted in a large number of Web data extraction approaches that differ at least according to the task domain, the degree of automation and the technique used [6]. This is an extended version of the paper: Amelia B˘adic˘a, Costin B˘adic˘a, Elvira Popescu: Using Logic Wrappers to Extract Hierarchical Data from HTML. In: Advances in Intelligent Web Mastering. Proc.AWIC’2007, Fontainebleu, France. Advances in Soft Computing 43, 25-40, Springer, 2007. J. Neves, M. Santos, and J. Machado (Eds.): EPIA 2007, LNAI 4874, pp. 43–52, 2007. c Springer-Verlag Berlin Heidelberg 2007

44

A. B˘adic˘a, C. B˘adic˘a, and E. Popescu

Our recent work in the area of Web data extraction was focused on combining logic programming with efficient XML processing [8]. The results were: i) definition of logic wrappers or L-wrappers for data extraction from the Web ([2]); ii) the development of a methodology for the application of L-wrappers on real problems ([2]); iii) design of efficient algorithms for s

Data Loading...

Application of Logic Wrappers to Hierarchical Data Extraction from HTML

Recommend Documents

HTML

From Paraconsistent Logic to Dialetheic Logic

HTML Template

The Role of HTML

Data Extraction

Visualizing Hierarchical Data

How to generalize from a hierarchical model?

Hierarchical Data Structures

The Logic of Conditionals An Application of Probability to Deductive

Extraction of Three-Dimensional Architectural Data from QuickBird Images

Vertical Data Mining from Relational Data and Its Application to COVID-19 Data

Formalization of Ternary Logic for Application to Digital Signal Processing