Entity Extraction and Correction Based on Token Structure Model Generation
The logical and semantic structure analysis is a basic process for invoice understanding. Be able to carry out a robust layout analysis is very difficult due to highly heterogeneous invoice templates. In this paper, we propose a local structure for entity
- PDF / 1,711,692 Bytes
- 11 Pages / 439.37 x 666.142 pts Page_size
- 34 Downloads / 233 Views
Tunis el Manar University, FST, Tunis, Tunisia [email protected] 2 Sfax University, ENIS, Sfax, Tunisia [email protected], [email protected]
Abstract. The logical and semantic structure analysis is a basic process for invoice understanding. Be able to carry out a robust layout analysis is very difficult due to highly heterogeneous invoice templates. In this paper, we propose a local structure for entity extraction and correction from scanned invoices. It attempts to extract entity in contiguous and noncontiguous structure by automatic finding the local structure of each entity without structure model matching and user intervention. Firstly, the entities are labeled in OCRed invoice. Combining labeled entities with geometric and semantic relations, token structure models are generated. These models are used for entity extraction and mislabeling correction by ignoring some superfluous tokens detected by labeling step. The correction module to the contiguous structure differs from that of the noncontiguous structure. The obtained results with a dataset of real invoices are reported in experimental section. Keywords: Contextual search · Contiguous and noncontiguous structure · Mislabeling correction · Token structure models
1
Introduction
In accordance with [1], Automatic document processing refers to three main categories; doctype classification, data capture/Functional Role Labeling, and document sets. Doctype classification is to assign a document image to a prestored template. Data capture represents the extraction of relevant human understandable information from document image. The category Document sets relates between documents and their contents depending on business logic. In this paper, we focus on automatic data capture from invoices regardless of their high geometric variations. Figure 1 shows some examples of entities in contiguous (Fig. 1(a)) and noncontiguous (Fig. 1(b)) structure. It illustrates how closeness, direction and graphical elements may differ in conjunction Reference Words (RWs) e.g., “FACTURE No ”, “Date”, “Net a` payer”, etc. with Key Fields (KFs) e.g., “006651”, “22/08/2015”, “228 276.300”, etc. for an entity, in various invoices. c Springer International Publishing AG 2016 A. Robles-Kelly et al. (Eds.): S+SSPR 2016, LNCS 10029, pp. 401–411, 2016. DOI: 10.1007/978-3-319-49055-7 36
402
N. Rahal et al.
Fig. 1. Sample of entities showed the diversity of layout styles used in invoice. (a) Entity in contiguous structure. (b) Entity in noncontiguous structure.
In this context, many initiative works, like [2], learn a local structure layout from training document and reuse it for extracting the fields in the test document. The weakness of such work is that require the human intervention for labeling semantic fields. Authors in [3,4] propose to correct the mislabeling by adding the missing labels. However, they require high regularity of structures and automatic blocks and segments obtained by OCR (Optical Character Recognition). Also, the mislabeling correction is based on matching a
Data Loading...