Deep Neural Networks for Web Page Information Extraction

Web wrappers are systems for extracting structured information from web pages. Currently, wrappers need to be adapted to a particular website template before they can start the extraction process. In this work we present a new method, which uses convoluti

  • PDF / 1,696,339 Bytes
  • 10 Pages / 439.37 x 666.142 pts Page_size
  • 1 Downloads / 215 Views

DOWNLOAD

REPORT


Abstract. Web wrappers are systems for extracting structured information from web pages. Currently, wrappers need to be adapted to a particular website template before they can start the extraction process. In this work we present a new method, which uses convolutional neural networks to learn a wrapper that can extract information from previously unseen templates. Therefore, this wrapper does not need any site-specific initialization and is able to extract information from a single web page. We also propose a method for spatial text encoding, which allows us to encode visual and textual content of a web page into a single neural net. The first experiments with product information extraction showed very promising results and suggest that this approach can lead to a general site-independent web wrapper.

Keywords: Information extraction neural networks

1

·

Web wrappers

·

Convolutional

Introduction

The Internet is the biggest and the fastest growing source of data in today’s world. Many information systems that gather structured data need to acquire information from web pages. However, HTML files are designed to be processed by web browsers and do not contain information in a structured form1 . Therefore, systems that can extract structured information from web pages receive special attention in the research community. Such tools are usually referred to as web wrappers. Although people can easily extract information from different web pages, the task of creating an automatic wrapper that can extract information from multiple websites is considered as a very complex problem. It is mainly because the semantics of elements depends on many properties such as textual content, 1

There are efforts to include structured data in HTML, such as schema.org project, but it is still not widely used by web developers.

c IFIP International Federation for Information Processing 2016  Published by Springer International Publishing Switzerland 2016. All Rights Reserved L. Iliadis and I. Maglogiannis (Eds.): AIAI 2016, IFIP AICT 475, pp. 154–163, 2016. DOI: 10.1007/978-3-319-44944-9 14

Deep Neural Networks for Web Page Information Extraction

155

visual appearance and relative positioning. Therefore, the research community is mainly focused on wrappers that need to be adapted to a particular website and then they can extract information from its web pages [3,5,8,10,16]. However, such approach brings many disadvantages, such as difficult scalability and maintenance. In this work, we show that a combination of visual and textual data in a single model can help us to create general (multi-site) wrapper. The three main contributions of this work are: (1) We propose a method of encoding data from a web rendering engine into a deep neural net - i.e. a method for spatial encoding of text. (2) On the task of product information extraction, we show that the neural net could be trained to extract information in non-trivial cases. (3) We make our dataset, source codes and final model public, in order to provide a benchmark for future work2 .