Learning cell embeddings for understanding table layouts
- PDF / 2,444,734 Bytes
- 26 Pages / 439.37 x 666.142 pts Page_size
- 0 Downloads / 211 Views
Learning cell embeddings for understanding table layouts Majid Ghasemi-Gol1
· Jay Pujara1 · Pedro Szekely1
Received: 13 February 2020 / Revised: 12 August 2020 / Accepted: 17 August 2020 © Springer-Verlag London Ltd., part of Springer Nature 2020
Abstract There is a large amount of data on the web in tabular form, such as Excel sheets, CSV files, and web tables. Often, tabular data is meant for human consumption, using data layouts that are difficult for machines to interpret automatically. Previous work uses the stylistic features of tabular cells (such as font size, border type, and background color) to classify tabular cells by their role in the data layout of the document (top attribute, data, metadata, etc.). In this paper, we propose a deep neural network model which can embed semantic and contextual information about tabular cells in a low-dimensional cell embedding space. We pre-train this cell embedding model on a large corpus of tabular documents from various domains. We then propose a classification technique based on recurrent neural networks (RNNs) to use our pretrained cell embeddings, combining them with stylistic features introduced in previous work, in order to improve the performance of cell type classification in complex documents. We evaluate the performance of our system on three datasets containing documents with various data layouts, in two settings: in-domain and cross-domain training. Our evaluation result shows that our proposed cell vector representations in combination with our RNN-based classification technique significantly improve cell type classification performance. Keywords Tabular data · Table layout · Cell embeddings · Representation learning · Cell classification · Semi-supervised learning
1 Introduction A vast amount of useful data is available in structured tabular formats, such as spreadsheets, comma-separated value files, and web tables. Tabular data is represented in a structured form following established principles of data organization [11,34]. However, understanding such data can be cognitively challenging for humans, and automated techniques for table understanding still struggle to parse arbitrary datasets. Tabular data covers many different
B
Majid Ghasemi-Gol [email protected] Jay Pujara [email protected] Pedro Szekely [email protected]
1
Information Science Institute, University of Southern California, Marina Del Rey, CA 90292, USA
123
M. Ghasemi-Gol et al.
domains and subjects and is expressed in formats that include hierarchical relationships (e.g., Fig. 1c) and concatenation of disparate data (e.g., Fig. 1b). One useful step toward understanding tabular data is to identify elements of tabular data layout by understanding the role of each tabular cell in the data layout of the tabular document. There are different definitions and terminologies used for different roles in tabular data layouts in the literature [8,22,33]. We combine the terminologies and definitions introduced by Chen et al. [8] and Koci et al. [22] which suggest that there are six major cel
Data Loading...