OCR error correction using correction patterns and self-organizing migrating algorithm

PDF / 3,243,728 Bytes
21 Pages / 595.276 x 790.866 pts Page_size
24 Downloads / 381 Views

THEORETICAL ADVANCES

OCR error correction using correction patterns and self‑organizing migrating algorithm Quoc‑Dung Nguyen1,4 · Duc‑Anh Le2,5 · Nguyet‑Minh Phan3 · Ivan Zelinka4 Received: 12 October 2019 / Accepted: 29 October 2020 © Springer-Verlag London Ltd., part of Springer Nature 2020

Abstract Optical character recognition (OCR) systems help to digitize paper-based historical achieves. However, poor quality of scanned documents and limitations of text recognition techniques result in different kinds of errors in OCR outputs. Postprocessing is an essential step in improving the output quality of OCR systems by detecting and cleaning the errors. In this paper, we present an automatic model consisting of both error detection and error correction phases for OCR post-processing. We propose a novel approach of OCR post-processing error correction using correction pattern edits and evolutionary algorithm which has been mainly used for solving optimization problems. Our model adopts a variant of the self-organizing migrating algorithm along with a fitness function based on modifications of important linguistic features. We illustrate how to construct the table of correction pattern edits involving all types of edit operations and being directly learned from the training dataset. Through efficient settings of the algorithm parameters, our model can be performed with high-quality candidate generation and error correction. The experimental results show that our proposed approach outperforms various baseline approaches as evaluated on the benchmark dataset of ICDAR 2017 Post-OCR text correction competition. Keywords OCR · N-grams · Similarity · Context · Correction pattern · Evolutionary algorithm

1 Introduction

* Quoc‑Dung Nguyen [email protected] Duc‑Anh Le [email protected] Nguyet‑Minh Phan [email protected] Ivan Zelinka [email protected] 1

Van Lang University, 45 Nguyen Khac Nhu, Co Giang Ward, District 1, Ho Chi Minh City, Vietnam

2

Center for Open Data in the Humanities, Tokyo 101‑8430, Japan

3

University of Information Technology, Quarter 6, Linh Trung Ward, Thu Duc District, Ho Chi Minh City, Vietnam

4

Department of Computer Science, FEECS VŠB - Technical University of Ostrava, 17. listopadu 15, 708 33 Ostrava‑Poruba, Czech Republic

5

NTT Hi‑Tech Institute, Nguyen Tat Thanh University, 300A Nguyen Tat Thanh, District 4, Ho Chi Minh city, Vietnam

Optical character recognition is the process of transforming typed, handwritten or printed text from scanned documents or images into digital text using various image processing and pattern recognition techniques [14, 21, 22, 48]. On account of enormous paper-based historical archives, there is a crucial need in digitizing the paper-based books, articles and documents into electronic versions with the help of OCR systems. However, the OCR process often results in misspellings and linguistic errors in OCR-generated texts due to misrecognized characters and falsely identified scanned texts, especially for degraded historical docum

Data Loading...

OCR error correction using correction patterns and self-organizing migrating algorithm

Recommend Documents

Video Error Correction Using Steganography

Error Correction

Forecast Error Correction using Dynamic Data Assimilation

Linear Network Error Correction Coding

Exogeneity in Error Correction Models

Moonshine, superconformal symmetry, and quantum error correction

Quantum Error Correction Symmetric, Asymmetric, Synchronizable, and

Neuronal Subcompartment Classification and Merge Error Correction

Forward Error Correction for Optical Transponders

Forward Error Correction via Channel Coding

When to Use OCR Post-correction for Named Entity Recognition?

Non-interactive OCR Post-correction for Giga-Scale Digitization Projects