Site-Level Web Template Extraction Based on DOM Analysis

One of the main development resources for website engineers are Web templates. Templates allow them to increase productivity by plugin content into already formatted and prepared pagelets. For the final user templates are also useful, because they provide

PDF / 1,426,361 Bytes
14 Pages / 439.37 x 666.142 pts Page_size
65 Downloads / 195 Views

DOWNLOAD

REPORT

Universitat Polit`ecnica de Val`encia, Camino de Vera s/n, 46022 Valencia, Spain {jalarte,dinsa,jsilva}@dsic.upv.es 2 IMDEA Software, Universidad Polit´ecnica de Madrid, Campus Montegancedo UPM, 28223 Pozuelo de Alarc´ on, Madrid, Spain [email protected]

Abstract. One of the main development resources for website engineers are Web templates. Templates allow them to increase productivity by plugin content into already formatted and prepared pagelets. For the ﬁnal user templates are also useful, because they provide uniformity and a common look and feel for all webpages. However, from the point of view of crawlers and indexers, templates are an important problem, because templates usually contain irrelevant information such as advertisements, menus, and banners. Processing and storing this information leads to a waste of resources (storage space, bandwidth, etc.). It has been measured that templates represent between 40 % and 50 % of data on the Web. Therefore, identifying templates is essential for indexing tasks. In this work we propose a novel method for automatic web template extraction that is based on similarity analysis between the DOM trees of a collection of webpages that are detected using an hyperlink analysis. Our implementation and experiments demonstrate the usefulness of the technique.

Keywords: Information retrieval extraction

1

·

Content extraction

·

Template

Introduction

A web template (in the following just template) is a prepared HTML page where formatting is already implemented and visual components are ready to insert content into them. Templates are an essential component of nowadays websites, and they are important for web developers, users, and also for indexers and crawlers:

This work has been partially supported by the EU (FEDER) and the Spanish Ministerio de Econom´ıa y Competitividad (Secretar´ıa de Estado de Investigaci´ on, Desarrollo e Innovaci´ on) under grant TIN2013-44742-C4-1-R and by the Generalitat Valenciana under grant PROMETEOII/2015/013. David Insa was partially supported by the Spanish Ministerio de Eduaci´ on under FPU grant AP2010-4415. c Springer International Publishing Switzerland 2016 M. Mazzara and A. Voronkov (Eds.): PSI 2015, LNCS 9609, pp. 36–49, 2016. DOI: 10.1007/978-3-319-41579-6 4

Site-Level Web Template Extraction Based on DOM Analysis

37

– Web developers use templates as a basis for composing new webpages that share a common look and feel. This also allows them to automate many tasks thanks to the reuse of components. In fact, many websites are maintained automatically by code generators that generate webpages using templates. – Users can beneﬁt from intuitive and uniform designs with a common vocabulary of colored and formatted visual elements. – Crawlers and indexers usually judge the relevance of a webpage according to the frequency and distribution of terms and hyperlinks. Since templates contain a considerable number of common terms and hyperlinks that are replicated in a large number of webpages, relevance may turn out to be ina

Data Loading...

Site-Level Web Template Extraction Based on DOM Analysis

Recommend Documents

Information Extraction Based on Event Driven from Template Web Pages

Extraction of Web Content Based on Content Type

Visual Web Information Extraction

Web Information Extraction System

Visual Web Data Extraction

Web Information Extraction

Web Data Extraction System

Web Content Extraction

Web Data Extraction

GUIs for Web Data Extraction

Fully-Automatic Web Data Extraction

Languages for Web Data Extraction