DeepTable: a permutation invariant neural network for table orientation classification

  • PDF / 924,770 Bytes
  • 21 Pages / 439.37 x 666.142 pts Page_size
  • 55 Downloads / 218 Views

DOWNLOAD

REPORT


DeepTable: a permutation invariant neural network for table orientation classification Maryam Habibi1 · Johannes Starlinger1

· Ulf Leser1

Received: 13 September 2019 / Accepted: 17 August 2020 © The Author(s) 2020

Abstract Tables are a common way to present information in an intuitive and concise manner. They are used extensively in media such as scientific articles or web pages. Automatically analyzing the content of tables bears special challenges. One of the most basic tasks is determination of the orientation of a table: In column tables, columns represent one entity with the different attribute values present in the different rows; row tables are vice versa, and matrix tables give information on pairs of entities. In this paper, we address the problem of classifying a given table into one of the three layouts horizontal (for row tables), vertical (for column tables), and matrix. We describe DeepTable, a novel method based on deep neural networks designed for learning from sets. Contrary to previous state-of-the-art methods, this basis makes DeepTable invariant to the permutation of rows or columns, which is a highly desirable property as in most tables the order of rows and columns does not carry specific information. We evaluate our method using a silver standard corpus of 5500 tables extracted from biomedical articles where the layout was determined heuristically. DeepTable outperforms previous methods in both precision and recall on our corpus. In a second evaluation, we manually labeled a corpus of 300 tables and were able to confirm DeepTable to reach superior performance in the table layout classification task. The codes and resources introduced here are available at https://github.com/Marhabibi/DeepTable. Keywords Information discovery · Tabular data · Table orientation classification · Deep learning · Machine learning

Responsible editor: Johannes Fürnkranz. Electronic supplementary material The online version of this article (https://doi.org/10.1007/s10618020-00711-x) contains supplementary material, which is available to authorized users.

B

Ulf Leser [email protected] Maryam Habibi [email protected]

1

Humboldt Universität zu Berlin, Berlin, Germany

123

M. Habibi et al.

1 Introduction An ever growing amount of information is managed and exchanged in digitized form, especially in the form of text (web pages, scientific articles, reports, newspapers, books etc.). Efficient and effective management of such large collections of text depends on the creation of computational tools for their automated analysis. A particularly important type of digital information that has seen comparably little attention in research so far are tables. Tables are a universal and intuitive means to structure information in a two-dimensional manner with a high density of information. They are used extensively in scientific articles, business reports, product descriptions, web pages, etc. Nevertheless, tables only recently were “discovered” as important first-class objects in research, mostl