Application of Logic Wrappers to Hierarchical Data Extraction from HTML
Logic wrappers combine logic programming paradigm with efficient XML processing for data extraction from HTML. In this note we show how logic wrappers technology can be adapted to cope with hierarchical data extraction. For this purpose we introduce hiera
- PDF / 405,969 Bytes
- 10 Pages / 430 x 660 pts Page_size
- 108 Downloads / 203 Views
iversity of Craiova, Business Information Systems Department A.I.Cuza 13, Craiova, RO-200585, Romania [email protected] 2 University of Craiova, Software Engineering Department Bvd.Decebal 107, Craiova, RO-200440, Romania {badica costin,popescu elvira}@software.ucv.ro Abstract. Logic wrappers combine logic programming paradigm with efficient XML processing for data extraction from HTML. In this note we show how logic wrappers technology can be adapted to cope with hierarchical data extraction. For this purpose we introduce hierarchical logic wrappers and illustrate their application by means of an intuitive example.
1 Introduction The Web is extensively used for information dissemination to humans and businesses. For this purpose Web technologies are used to convert data from internal formats, usually specific to data base management systems, to suitable presentations for attracting human users. However, the interest has rapidly shifted to make that information available for machine consumption by realizing that Web data can be reused for various problem solving purposes including common tasks like searching and filtering, and also more complex tasks like analysis, decision making, reasoning and integration. Two emergent technologies that have been put forward to enable automated processing of information available on the Web are semantic markup [14] and Web services [15]. Note however that most of the current practices in Web publishing are still based on the combination of traditional HTML – lingua franca for Web publishing [10], with server-side dynamic content generation from databases. Moreover, many Web pages are using HTML elements that were originally intended for use to structure content (e.g. those elements related to tables), for layout and presentation effects, even if this practice is not encouraged in theory. Therefore, techniques developed in areas like information extraction, machine learning and wrapper induction are expected to play a significant role in tackling the problem of Web data extraction. Research in this area resulted in a large number of Web data extraction approaches that differ at least according to the task domain, the degree of automation and the technique used [6]. This is an extended version of the paper: Amelia B˘adic˘a, Costin B˘adic˘a, Elvira Popescu: Using Logic Wrappers to Extract Hierarchical Data from HTML. In: Advances in Intelligent Web Mastering. Proc.AWIC’2007, Fontainebleu, France. Advances in Soft Computing 43, 25-40, Springer, 2007. J. Neves, M. Santos, and J. Machado (Eds.): EPIA 2007, LNAI 4874, pp. 43–52, 2007. c Springer-Verlag Berlin Heidelberg 2007
44
A. B˘adic˘a, C. B˘adic˘a, and E. Popescu
Our recent work in the area of Web data extraction was focused on combining logic programming with efficient XML processing [8]. The results were: i) definition of logic wrappers or L-wrappers for data extraction from the Web ([2]); ii) the development of a methodology for the application of L-wrappers on real problems ([2]); iii) design of efficient algorithms for s
Data Loading...