Identifying Web Tables: Supporting a Neglected Type of Content on the Web
The abundance of the data in the Internet facilitates the improvement of extraction and processing tools. The trend in the open data publishing encourages the adoption of structured formats like CSV and RDF. However, there is still a plethora of unstructu
- PDF / 2,250,562 Bytes
- 15 Pages / 439.37 x 666.142 pts Page_size
- 35 Downloads / 188 Views
2
University of Bonn, Bonn, Germany [email protected] ITMO University, Saint Petersburg, Russia [email protected]
Abstract. The abundance of the data in the Internet facilitates the improvement of extraction and processing tools. The trend in the open data publishing encourages the adoption of structured formats like CSV and RDF. However, there is still a plethora of unstructured data on the Web which we assume contain semantics. For this reason, we propose an approach to derive semantics from web tables which are still the most popular publishing tool on the Web. The paper also discusses methods and services of unstructured data extraction and processing as well as machine learning techniques to enhance such a workflow. The eventual result is a framework to process, publish and visualize linked open data. The software enables tables extraction from various open data sources in the HTML format and an automatic export to the RDF format making the data linked. The paper also gives the evaluation of machine learning techniques in conjunction with string similarity functions to be applied in a tables recognition task.
Keywords: Machine learning
1
· Linked Data · Semantic Web
Introduction
The Web contains various types of content, e.g. text, pictures, video, audio as well as tables. Tables are used everywhere in the Web to represent statistical data, sports results, music data and arbitrary lists of parameters. Recent research [2,3] conducted on the Common Crawl census1 indicated that an average Web page contains at least nine tables. In this research about 12 billion tables were extracted from a billion of HTML pages, which demonstrates the popularity of this type of data representation. Tables are a natural way how people interact with structured data and can provide a comprehensive overview of large amounts and complex information. The prevailing part of structured information on the Web is stored in tables. Nevertheless, we argue that table is still a neglected content type regarding processing, extraction and annotation tools. 1
Web: http://commoncrawl.org/
c Springer International Publishing Switzerland 2015 P. Klinov and D. Mouromtsev (Eds.): KESW 2015, CCIS 518, pp. 48–62, 2015. DOI: 10.1007/978-3-319-24543-0 4
Identifying Web Tables: Supporting a Neglected Type
49
For example, even though there are billions of tables on the Web search engines are still not able to index them in a way that facilitates data retrieval. The annotation and retrieval of pictures, video and audio data is meanwhile well supported, whereas on of the most widespread content types is still not sufficiently supported. Assumption that an average table contains on average 50 facts it is possible to extract more than 600 billion facts taking into account only the 12 billion sample tables found in the Common Crawl. This is already six times more than the whole Linked Open Data Cloud 2 . Moreover, despite a shift towards semantic annotation (e.g. via RDFa) there will always be plain tables abundantly available on the Web. Wit
Data Loading...