Optical character recognition with neural networks and post-correction with finite state methods

PDF / 773,721 Bytes
17 Pages / 595.276 x 790.866 pts Page_size
93 Downloads / 253 Views

ORIGINAL PAPER

Optical character recognition with neural networks and post-correction with finite state methods Senka Drobac1

· Krister Lindén1

Received: 12 December 2019 / Revised: 9 June 2020 / Accepted: 4 August 2020 © The Author(s) 2020

Abstract The optical character recognition (OCR) quality of the historical part of the Finnish newspaper and journal corpus is rather low for reliable search and scientific research on the OCRed data. The estimated character error rate (CER) of the corpus, achieved with commercial software, is between 8 and 13%. There have been earlier attempts to train high-quality OCR models with open-source software, like Ocropy (https://github.com/tmbdev/ocropy) and Tesseract (https://github.com/tesseract-ocr/ tesseract), but so far, none of the methods have managed to successfully train a mixed model that recognizes all of the data in the corpus, which would be essential for an efficient re-OCRing of the corpus. The difficulty lies in the fact that the corpus is printed in the two main languages of Finland (Finnish and Swedish) and in two font families (Blackletter and Antiqua). In this paper, we explore the training of a variety of OCR models with deep neural networks (DNN). First, we find an optimal DNN for our data and, with additional training data, successfully train high-quality mixed-language models. Furthermore, we revisit the effect of confidence voting on the OCR results with different model combinations. Finally, we perform post-correction on the new OCR results and perform error analysis. The results show a significant boost in accuracy, resulting in 1.7% CER on the Finnish and 2.7% CER on the Swedish test set. The greatest accomplishment of the study is the successful training of one mixed language model for the entire corpus and finding a voting setup that further improves the results. Keywords OCR · Historical periodicals · Finnish · Swedish

1 Introduction The OCR of historical newspapers published in Finland 1771–1929 is of unsatisfactory quality. The entire corpus1 has been recognized with ABBYY FineReader 11 and presents a character error rate between 8 and 13%. This error rate is rather high for meaningful and reliable scientific research on this data set, so there is a need to re-OCR the entire corpus. OCRing the corpus is difficult because it contains very diverse data written in a non-standard language. Newspapers in Finland from the eighteenth to the early twentieth century were printed in the two main languages of Finland 1

https://digi.kansalliskirjasto.fi.

B

Senka Drobac [email protected] Krister Lindén [email protected]

1

University of Helsinki, Helsinki, Finland

(Finnish and Swedish) using two font families (Blackletter and Antiqua) with a large variety of fonts. Also, the data are not evenly distributed. In earlier data, there is more material printed in Swedish with Blackletter fonts, whereas the later data is mostly printed in Finnish with Antiqua fonts. However, there are periods when both languages and both font families were used

Data Loading...

Optical character recognition with neural networks and post-correction with finite state methods

Recommend Documents

Optimising Handwritten-Character Recognition with Logic Neural Networks

Optical Character Recognition Systems for Different Languages with Soft Computing

Optical Character Recognition for Nepali, English Character and Simple Sketch Using Neural Network

Gaussian Synapse Networks for Handwritten Character Recognition

Quanvolutional neural networks: powering image recognition with quantum circuits

Automatic Speech Recognition of Arabic Phonemes with Neural Networks

Optical Neural Networks

Scene Character Recognition with Morphological Filtering and HOG Features

Neural Networks with Internal Dynamics

Granular Neural Networks, Pattern Recognition and Bioinformatics

Pattern Recognition using Neural and Functional Networks

Finite Geometry and Character Theory