Automatic Table-of-Contents Generation for Efficient Information Access
- PDF / 3,411,546 Bytes
- 17 Pages / 595.276 x 790.866 pts Page_size
- 101 Downloads / 195 Views
ORIGINAL RESEARCH
Automatic Table‑of‑Contents Generation for Efficient Information Access Najah‑Imane Bentabet1 · Rémi Juge1 · Ismaïl El Maarouf1 · Dialekti Valsamou‑Stanislawski1 · Sira Ferradans1 Received: 1 February 2020 / Accepted: 11 August 2020 © Springer Nature Singapore Pte Ltd 2020
Abstract Purpose This paper presents a novel neural-based approach, applicable to any searchable PDF document that first detects the titles and then hierarchically orders them using a sequence labelling approach to generate automatically the Table of Contents (TOC). A TOC signals the main divisions and subdivisions of a document to assist with navigation and information localisation. Methods Unlike previous methods, we do not assume the presence of parsable TOC pages in the document but infer the TOC from a data-driven analysis of sections titles, their order and their depth. Results We offer an exhaustive analysis of the proposed model and evaluate it on French and English using documents from the financial domain, which we release to increase community’s interest. We compare this model to state-of-the-art approaches and show its superiority in multiple experiments. Conclusions The approach described in this paper can easily be adapted to other domains and documents and its application to the analysis of financial prospectuses will be strengthened by the release of datasets. The TOC generation algorithms used in this paper obtain state-of-the-art results and provide strong baselines for future work. Keywords Table of contents generation · Title detection · Layout structure analysis · PDF · Deep learning · Financial information processing
Introduction As with many professional domains, Finance conveys most of its policy, regulation, and corporate information through electronic documents first elaborated with office suites and Najah-Imane Bentabet and Rémi Juge have contributed equally to this work. This article is part of the topical collection ”Document Analysis and Recognition” guest edited by Michael Blumenstein, Seiichi Uchida and Cheng-Lin Liu. * Ismaïl El Maarouf [email protected] Najah‑Imane Bentabet najah‑[email protected] Rémi Juge [email protected] Dialekti Valsamou‑Stanislawski [email protected] 1
Fortia Financial Solutions, 17 Av George V, Paris, France
then converted to PDF before publication. Documents obviously do not simply expose raw text and significant effort is made towards organising its layout. Indeed, layout plays a key role in document understanding, by bringing objects in relation (e.g. referencing illustrations in the text), increasing readability (e.g. using spaced paragraphs), assisting navigation (e.g. cross-referencing between sections), and organising content (e.g. summarising with section titles). Document layout is also frequently codified and mandatory templates are typically created to harmonise publications of the same type and to ensure compliance with regulations. For instance the French financial authority (AMF) provides a template for financial pr
Data Loading...