Automatic Table-of-Contents Generation for Efficient Information Access

PDF / 3,411,546 Bytes
17 Pages / 595.276 x 790.866 pts Page_size
101 Downloads / 328 Views

ORIGINAL RESEARCH

Automatic Table‑of‑Contents Generation for Efficient Information Access Najah‑Imane Bentabet1 · Rémi Juge1 · Ismaïl El Maarouf1 · Dialekti Valsamou‑Stanislawski1 · Sira Ferradans1 Received: 1 February 2020 / Accepted: 11 August 2020 © Springer Nature Singapore Pte Ltd 2020

Abstract Purpose This paper presents a novel neural-based approach, applicable to any searchable PDF document that first detects the titles and then hierarchically orders them using a sequence labelling approach to generate automatically the Table of Contents (TOC). A TOC signals the main divisions and subdivisions of a document to assist with navigation and information localisation. Methods Unlike previous methods, we do not assume the presence of parsable TOC pages in the document but infer the TOC from a data-driven analysis of sections titles, their order and their depth. Results We offer an exhaustive analysis of the proposed model and evaluate it on French and English using documents from the financial domain, which we release to increase community’s interest. We compare this model to state-of-the-art approaches and show its superiority in multiple experiments. Conclusions The approach described in this paper can easily be adapted to other domains and documents and its application to the analysis of financial prospectuses will be strengthened by the release of datasets. The TOC generation algorithms used in this paper obtain state-of-the-art results and provide strong baselines for future work. Keywords Table of contents generation · Title detection · Layout structure analysis · PDF · Deep learning · Financial information processing

Introduction As with many professional domains, Finance conveys most of its policy, regulation, and corporate information through electronic documents first elaborated with office suites and Najah-Imane Bentabet and Rémi Juge have contributed equally to this work. This article is part of the topical collection ”Document Analysis and Recognition” guest edited by Michael Blumenstein, Seiichi Uchida and Cheng-Lin Liu. * Ismaïl El Maarouf [email protected] Najah‑Imane Bentabet najah‑[email protected] Rémi Juge [email protected] Dialekti Valsamou‑Stanislawski [email protected] 1

Fortia Financial Solutions, 17 Av George V, Paris, France

then converted to PDF before publication. Documents obviously do not simply expose raw text and significant effort is made towards organising its layout. Indeed, layout plays a key role in document understanding, by bringing objects in relation (e.g. referencing illustrations in the text), increasing readability (e.g. using spaced paragraphs), assisting navigation (e.g. cross-referencing between sections), and organising content (e.g. summarising with section titles). Document layout is also frequently codified and mandatory templates are typically created to harmonise publications of the same type and to ensure compliance with regulations. For instance the French financial authority (AMF) provides a template for financial pr

Data Loading...

Automatic Table-of-Contents Generation for Efficient Information Access

Recommend Documents

An efficient generic approach for automatic taxonomy generation using HMMs

Towards More Efficient Screen Reader Web Access with Automatic Summary Generation and Text Tagging

Automatic generation of efficient policy alternatives via simulation-optimization

Automatic Graphics Generation

Automatic Schema Generation for Document-Oriented Systems

Structuring Broadcast Audio for Information Access

Automatic Information Extraction

Intelligent Information Access

Temporal Information Access

Information Security for Automatic Speaker Identification

Beyond the Next Generation Access

Framework for Automatic VPN Access to Remotely Discovered Resources