Entity Extraction and Correction Based on Token Structure Model Generation

The logical and semantic structure analysis is a basic process for invoice understanding. Be able to carry out a robust layout analysis is very difficult due to highly heterogeneous invoice templates. In this paper, we propose a local structure for entity

PDF / 1,711,692 Bytes
11 Pages / 439.37 x 666.142 pts Page_size
34 Downloads / 351 Views

DOWNLOAD

REPORT

Tunis el Manar University, FST, Tunis, Tunisia [email protected] 2 Sfax University, ENIS, Sfax, Tunisia [email protected], [email protected]

Abstract. The logical and semantic structure analysis is a basic process for invoice understanding. Be able to carry out a robust layout analysis is very diﬃcult due to highly heterogeneous invoice templates. In this paper, we propose a local structure for entity extraction and correction from scanned invoices. It attempts to extract entity in contiguous and noncontiguous structure by automatic ﬁnding the local structure of each entity without structure model matching and user intervention. Firstly, the entities are labeled in OCRed invoice. Combining labeled entities with geometric and semantic relations, token structure models are generated. These models are used for entity extraction and mislabeling correction by ignoring some superﬂuous tokens detected by labeling step. The correction module to the contiguous structure diﬀers from that of the noncontiguous structure. The obtained results with a dataset of real invoices are reported in experimental section. Keywords: Contextual search · Contiguous and noncontiguous structure · Mislabeling correction · Token structure models

1

Introduction

In accordance with [1], Automatic document processing refers to three main categories; doctype classiﬁcation, data capture/Functional Role Labeling, and document sets. Doctype classiﬁcation is to assign a document image to a prestored template. Data capture represents the extraction of relevant human understandable information from document image. The category Document sets relates between documents and their contents depending on business logic. In this paper, we focus on automatic data capture from invoices regardless of their high geometric variations. Figure 1 shows some examples of entities in contiguous (Fig. 1(a)) and noncontiguous (Fig. 1(b)) structure. It illustrates how closeness, direction and graphical elements may diﬀer in conjunction Reference Words (RWs) e.g., “FACTURE No ”, “Date”, “Net a` payer”, etc. with Key Fields (KFs) e.g., “006651”, “22/08/2015”, “228 276.300”, etc. for an entity, in various invoices. c Springer International Publishing AG 2016 A. Robles-Kelly et al. (Eds.): S+SSPR 2016, LNCS 10029, pp. 401–411, 2016. DOI: 10.1007/978-3-319-49055-7 36

402

N. Rahal et al.

Fig. 1. Sample of entities showed the diversity of layout styles used in invoice. (a) Entity in contiguous structure. (b) Entity in noncontiguous structure.

In this context, many initiative works, like [2], learn a local structure layout from training document and reuse it for extracting the ﬁelds in the test document. The weakness of such work is that require the human intervention for labeling semantic ﬁelds. Authors in [3,4] propose to correct the mislabeling by adding the missing labels. However, they require high regularity of structures and automatic blocks and segments obtained by OCR (Optical Character Recognition). Also, the mislabeling correction is based on matching a

Data Loading...

Entity Extraction and Correction Based on Token Structure Model Generation

Recommend Documents

A Semi-supervised Joint Entity and Relation Extraction Model Based on Tagging Scheme and Information Gain

Joint Extraction of Entity and Semantic Relation Using Encoder - Decoder Model Based on Attention Mechanism

A Novel Entity Relation Extraction Approach Based on Micro-Blog

Entity Relative Position Representation Based Multi-head Selection for Joint Entity and Relation Extraction

Analyzing the Chain of Trust Model Based on Entity Dependence

Extraction of Baseline Based on Second-Generation Wavelet Transform

AHIAP: An Agile Medical Named Entity Recognition and Relation Extraction Framework Based on Active Learning

Towards Model Construction Based on Test Cases and GUI Extraction

Extended Entity-Relationship Model

Entity-Relationship Model

Entity Relationship Model

Automatic Story Generation Based on Graph Model Using Godot Engine