Identifying Web Tables: Supporting a Neglected Type of Content on the Web

The abundance of the data in the Internet facilitates the improvement of extraction and processing tools. The trend in the open data publishing encourages the adoption of structured formats like CSV and RDF. However, there is still a plethora of unstructu

PDF / 2,250,562 Bytes
15 Pages / 439.37 x 666.142 pts Page_size
35 Downloads / 206 Views

DOWNLOAD

REPORT

2

University of Bonn, Bonn, Germany [email protected] ITMO University, Saint Petersburg, Russia [email protected]

Abstract. The abundance of the data in the Internet facilitates the improvement of extraction and processing tools. The trend in the open data publishing encourages the adoption of structured formats like CSV and RDF. However, there is still a plethora of unstructured data on the Web which we assume contain semantics. For this reason, we propose an approach to derive semantics from web tables which are still the most popular publishing tool on the Web. The paper also discusses methods and services of unstructured data extraction and processing as well as machine learning techniques to enhance such a workﬂow. The eventual result is a framework to process, publish and visualize linked open data. The software enables tables extraction from various open data sources in the HTML format and an automatic export to the RDF format making the data linked. The paper also gives the evaluation of machine learning techniques in conjunction with string similarity functions to be applied in a tables recognition task.

Keywords: Machine learning

1

· Linked Data · Semantic Web

Introduction

The Web contains various types of content, e.g. text, pictures, video, audio as well as tables. Tables are used everywhere in the Web to represent statistical data, sports results, music data and arbitrary lists of parameters. Recent research [2,3] conducted on the Common Crawl census1 indicated that an average Web page contains at least nine tables. In this research about 12 billion tables were extracted from a billion of HTML pages, which demonstrates the popularity of this type of data representation. Tables are a natural way how people interact with structured data and can provide a comprehensive overview of large amounts and complex information. The prevailing part of structured information on the Web is stored in tables. Nevertheless, we argue that table is still a neglected content type regarding processing, extraction and annotation tools. 1

Web: http://commoncrawl.org/

c Springer International Publishing Switzerland 2015 P. Klinov and D. Mouromtsev (Eds.): KESW 2015, CCIS 518, pp. 48–62, 2015. DOI: 10.1007/978-3-319-24543-0 4

Identifying Web Tables: Supporting a Neglected Type

49

For example, even though there are billions of tables on the Web search engines are still not able to index them in a way that facilitates data retrieval. The annotation and retrieval of pictures, video and audio data is meanwhile well supported, whereas on of the most widespread content types is still not suﬃciently supported. Assumption that an average table contains on average 50 facts it is possible to extract more than 600 billion facts taking into account only the 12 billion sample tables found in the Common Crawl. This is already six times more than the whole Linked Open Data Cloud 2 . Moreover, despite a shift towards semantic annotation (e.g. via RDFa) there will always be plain tables abundantly available on the Web. Wit

Data Loading...

Identifying Web Tables: Supporting a Neglected Type of Content on the Web

Recommend Documents

Extraction of Web Content Based on Content Type

TabEL: Entity Linking in Web Tables

Web Content Extraction

Web Content Mining

Content-aware web robot detection

Web System for Supporting Project Management

Erfolgsfaktor Content Management Vom Web Content bis zum Knowled

From Requirements to a RESTful Web Service: Engineering Content Oriented Web Services with REST

Reasoning Web. Semantic Interoperability on the Web 13th Internation

External Web content and its influence on organizational performance

Holiday Packages on the Web

Web 2.0 & Semantic Web