NN-based analytic approach to symbol level recognition for degraded Bengali printed documents

  • PDF / 4,956,931 Bytes
  • 22 Pages / 595.276 x 790.866 pts Page_size
  • 24 Downloads / 178 Views

DOWNLOAD

REPORT


Sådhanå (2020)45:263 https://doi.org/10.1007/s12046-020-01492-1

Sadhana(0123456789().,-volV)FT3](012345 6789().,-volV)

NN-based analytic approach to symbol level recognition for degraded Bengali printed documents JAYATI MUKHERJEE1,* , SWAPAN K PARUI2 and UTPAL ROY1 1

Department of Computer and System Sciences, Visva Bharati, Santiniketan, India Computer Vision and Pattern Recognition Unit, Indian Statistical Institute, Kolkata, India e-mail: [email protected]; [email protected]; [email protected]

2

MS received 25 October 2019; revised 21 July 2020; accepted 9 August 2020 Abstract. Analysis of degraded printed documents has been a research topic for last several years. In this article the contribution lies in segmentation of word images into symbols and recognition of the symbols of degraded printed document images of Bengali, the 7th most popular language in the world. A novel approach to symbol level segmentation based on a Multilayer Perceptron (MLP) network is proposed. A database of segmenting and non-segmenting image columns is developed from the ISIDDI page level database and segmentation is treated as a two-class classification problem. The MLP weights are learnt based on this database using the back propagation algorithm. We have introduced certain new metrics, based on which the F-score of the proposed segmentation algorithm is determined. Our method utilizes information that is relevant for character segmentation, ignoring other highly variable information contained in a printed text document, thus allowing for efficient transfer learning between datasets and alleviating the need for labelled training data. Other than Bengali, we have tested on English, Tamil and Devnagari scripts. For the classification purpose we have identified 336 symbols, and the corresponding training and test sets have been developed. The ISIDDI database is used for this purpose. Two classifiers, one CNN based and the other LSTM based, have been developed for this 336-class problem. The classification accuracies obtained on the test set by the CNN classifier and the LSTM classifier are 86.05% and 88.11%, respectively. The proposed classifiers outperform the existing classifiers for the ISIDDI database. Keywords.

Degraded document processing; Neural network; Analytic approach; Transfer learning.

1. Introduction Some old documents, particularly of 1960s and 1970s, are degrading day-by-day due to unavoidable causes. For preservation of our cultural heritage, OCR (Optical Character Recognition) of such degraded old documents has sufficient importance in the field of research. Digitizing such old degraded documents will help us to electronically edit, context-based search the data and store it for easy document management. Not only these, for application on other fields like machine translation, text-to-speech, text mining, etc., automatic recognition of Bengali degraded documents is necessary. Rigorous research on OCR is going on for a few decades not only for languages like Greek [1, 2], Latin [3], Chinese [4, 5], Arab