Automatic Table-of-Contents Generation for Efficient Information Access

  • PDF / 3,411,546 Bytes
  • 17 Pages / 595.276 x 790.866 pts Page_size
  • 101 Downloads / 198 Views

DOWNLOAD

REPORT


ORIGINAL RESEARCH

Automatic Table‑of‑Contents Generation for Efficient Information Access Najah‑Imane Bentabet1 · Rémi Juge1 · Ismaïl El Maarouf1   · Dialekti Valsamou‑Stanislawski1 · Sira Ferradans1 Received: 1 February 2020 / Accepted: 11 August 2020 © Springer Nature Singapore Pte Ltd 2020

Abstract Purpose  This paper presents a novel neural-based approach, applicable to any searchable PDF document that first detects the titles and then hierarchically orders them using a sequence labelling approach to generate automatically the Table of Contents (TOC). A TOC signals the main divisions and subdivisions of a document to assist with navigation and information localisation. Methods  Unlike previous methods, we do not assume the presence of parsable TOC pages in the document but infer the TOC from a data-driven analysis of sections titles, their order and their depth. Results  We offer an exhaustive analysis of the proposed model and evaluate it on French and English using documents from the financial domain, which we release to increase community’s interest. We compare this model to state-of-the-art approaches and show its superiority in multiple experiments. Conclusions  The approach described in this paper can easily be adapted to other domains and documents and its application to the analysis of financial prospectuses will be strengthened by the release of datasets. The TOC generation algorithms used in this paper obtain state-of-the-art results and provide strong baselines for future work. Keywords  Table of contents generation · Title detection · Layout structure analysis · PDF · Deep learning · Financial information processing

Introduction As with many professional domains, Finance conveys most of its policy, regulation, and corporate information through electronic documents first elaborated with office suites and Najah-Imane Bentabet and Rémi Juge have contributed equally to this work. This article is part of the topical collection ”Document Analysis and Recognition” guest edited by Michael Blumenstein, Seiichi Uchida and Cheng-Lin Liu. * Ismaïl El Maarouf [email protected] Najah‑Imane Bentabet najah‑[email protected] Rémi Juge [email protected] Dialekti Valsamou‑Stanislawski [email protected] 1



Fortia Financial Solutions, 17 Av George V, Paris, France

then converted to PDF before publication. Documents obviously do not simply expose raw text and significant effort is made towards organising its layout. Indeed, layout plays a key role in document understanding, by bringing objects in relation (e.g. referencing illustrations in the text), increasing readability (e.g. using spaced paragraphs), assisting navigation (e.g. cross-referencing between sections), and organising content (e.g. summarising with section titles). Document layout is also frequently codified and mandatory templates are typically created to harmonise publications of the same type and to ensure compliance with regulations. For instance the French financial authority (AMF) provides a template for financial pr