OCR error correction using correction patterns and self-organizing migrating algorithm
- PDF / 3,243,728 Bytes
- 21 Pages / 595.276 x 790.866 pts Page_size
- 24 Downloads / 275 Views
THEORETICAL ADVANCES
OCR error correction using correction patterns and self‑organizing migrating algorithm Quoc‑Dung Nguyen1,4 · Duc‑Anh Le2,5 · Nguyet‑Minh Phan3 · Ivan Zelinka4 Received: 12 October 2019 / Accepted: 29 October 2020 © Springer-Verlag London Ltd., part of Springer Nature 2020
Abstract Optical character recognition (OCR) systems help to digitize paper-based historical achieves. However, poor quality of scanned documents and limitations of text recognition techniques result in different kinds of errors in OCR outputs. Postprocessing is an essential step in improving the output quality of OCR systems by detecting and cleaning the errors. In this paper, we present an automatic model consisting of both error detection and error correction phases for OCR post-processing. We propose a novel approach of OCR post-processing error correction using correction pattern edits and evolutionary algorithm which has been mainly used for solving optimization problems. Our model adopts a variant of the self-organizing migrating algorithm along with a fitness function based on modifications of important linguistic features. We illustrate how to construct the table of correction pattern edits involving all types of edit operations and being directly learned from the training dataset. Through efficient settings of the algorithm parameters, our model can be performed with high-quality candidate generation and error correction. The experimental results show that our proposed approach outperforms various baseline approaches as evaluated on the benchmark dataset of ICDAR 2017 Post-OCR text correction competition. Keywords OCR · N-grams · Similarity · Context · Correction pattern · Evolutionary algorithm
1 Introduction
* Quoc‑Dung Nguyen [email protected] Duc‑Anh Le [email protected] Nguyet‑Minh Phan [email protected] Ivan Zelinka [email protected] 1
Van Lang University, 45 Nguyen Khac Nhu, Co Giang Ward, District 1, Ho Chi Minh City, Vietnam
2
Center for Open Data in the Humanities, Tokyo 101‑8430, Japan
3
University of Information Technology, Quarter 6, Linh Trung Ward, Thu Duc District, Ho Chi Minh City, Vietnam
4
Department of Computer Science, FEECS VŠB - Technical University of Ostrava, 17. listopadu 15, 708 33 Ostrava‑Poruba, Czech Republic
5
NTT Hi‑Tech Institute, Nguyen Tat Thanh University, 300A Nguyen Tat Thanh, District 4, Ho Chi Minh city, Vietnam
Optical character recognition is the process of transforming typed, handwritten or printed text from scanned documents or images into digital text using various image processing and pattern recognition techniques [14, 21, 22, 48]. On account of enormous paper-based historical archives, there is a crucial need in digitizing the paper-based books, articles and documents into electronic versions with the help of OCR systems. However, the OCR process often results in misspellings and linguistic errors in OCR-generated texts due to misrecognized characters and falsely identified scanned texts, especially for degraded historical docum
Data Loading...