Deep Neural Networks for Web Page Information Extraction

Web wrappers are systems for extracting structured information from web pages. Currently, wrappers need to be adapted to a particular website template before they can start the extraction process. In this work we present a new method, which uses convoluti

PDF / 1,696,339 Bytes
10 Pages / 439.37 x 666.142 pts Page_size
1 Downloads / 261 Views

DOWNLOAD

REPORT

Abstract. Web wrappers are systems for extracting structured information from web pages. Currently, wrappers need to be adapted to a particular website template before they can start the extraction process. In this work we present a new method, which uses convolutional neural networks to learn a wrapper that can extract information from previously unseen templates. Therefore, this wrapper does not need any site-speciﬁc initialization and is able to extract information from a single web page. We also propose a method for spatial text encoding, which allows us to encode visual and textual content of a web page into a single neural net. The ﬁrst experiments with product information extraction showed very promising results and suggest that this approach can lead to a general site-independent web wrapper.

Keywords: Information extraction neural networks

1

·

Web wrappers

·

Convolutional

Introduction

The Internet is the biggest and the fastest growing source of data in today’s world. Many information systems that gather structured data need to acquire information from web pages. However, HTML ﬁles are designed to be processed by web browsers and do not contain information in a structured form1 . Therefore, systems that can extract structured information from web pages receive special attention in the research community. Such tools are usually referred to as web wrappers. Although people can easily extract information from diﬀerent web pages, the task of creating an automatic wrapper that can extract information from multiple websites is considered as a very complex problem. It is mainly because the semantics of elements depends on many properties such as textual content, 1

There are eﬀorts to include structured data in HTML, such as schema.org project, but it is still not widely used by web developers.

c IFIP International Federation for Information Processing 2016 Published by Springer International Publishing Switzerland 2016. All Rights Reserved L. Iliadis and I. Maglogiannis (Eds.): AIAI 2016, IFIP AICT 475, pp. 154–163, 2016. DOI: 10.1007/978-3-319-44944-9 14

Deep Neural Networks for Web Page Information Extraction

155

visual appearance and relative positioning. Therefore, the research community is mainly focused on wrappers that need to be adapted to a particular website and then they can extract information from its web pages [3,5,8,10,16]. However, such approach brings many disadvantages, such as diﬃcult scalability and maintenance. In this work, we show that a combination of visual and textual data in a single model can help us to create general (multi-site) wrapper. The three main contributions of this work are: (1) We propose a method of encoding data from a web rendering engine into a deep neural net - i.e. a method for spatial encoding of text. (2) On the task of product information extraction, we show that the neural net could be trained to extract information in non-trivial cases. (3) We make our dataset, source codes and ﬁnal model public, in order to provide a benchmark for future work2 .

Data Loading...

Deep Neural Networks for Web Page Information Extraction

Recommend Documents

Visual Web Information Extraction

Web Information Extraction System

Web Information Extraction

An FW-DTSS Based Approach for News Page Information Extraction

Pattern-Based Extraction of Addresses from Web Page Content

Updating a web page

Web Page Quality Metrics

Automatic Dropout for Deep Neural Networks

Deep Neural Networks for Landmines Images Classification

Deep Neural Networks for Supervised Learning: Classification

Deep Neural Networks for Supervised Learning: Regression

Deep Neural Networks: Incremental Learning