Zone-based keyword spotting in Bangla and Devanagari documents

  • PDF / 4,637,374 Bytes
  • 25 Pages / 439.37 x 666.142 pts Page_size
  • 48 Downloads / 157 Views

DOWNLOAD

REPORT


Zone-based keyword spotting in Bangla and Devanagari documents Ayan Kumar Bhunia 1 & Partha Pratim Roy 2

3

& Aneeshan Sain & Umapada Pal

4

Received: 30 May 2018 / Revised: 10 October 2019 / Accepted: 7 November 2019 # Springer Science+Business Media, LLC, part of Springer Nature 2020

Abstract

In this paper, we present a word spotting system in text lines for offline Indic scripts such as Bangla (Bengali) and Devanagari. Recently, it was shown that the zone-wise recognition method improves word recognition performance than the conventional full word recognition system in Indic scripts, like Bangla, Devanagari, Gurumukhi (Roy et al. in Pattern Recogn 60: 1057-1075, 26; Bhunia et al. in Pattern Recogn 79: 12–31, 6). Inspired from this idea we consider the zone segmentation approach and use middle zone information to improve the traditional word spotting performance. To avoid the problem of zone segmentation using heuristic approach, we propose here a new HMM based approach to segment the upper and lower zone components from the text line images. The candidate keywords are searched from a line without segmenting characters or words. Also, we propose a feature combining foreground and background information of text line images for keyword-spotting by character filler models. A significant improvement in performance is noted by using both foreground and background information instead of the individual one. Pyramid Histogram of Oriented Gradient (PHOG) feature has been used in our word spotting framework. From the experiment, it has been noted that the proposed zone-segmentation based system outperforms traditional approaches of word spotting. Keywords Wordspotting . Handwritten textrecognition . Knowledgeextraction . Hidden Markov model

* Ayan Kumar Bhunia [email protected]

1

Department of ECE, Institute of Engineering & Management, Kolkata, India

2

Department of CSE, Indian Institute of Technology Roorkee, Roorkee, Uttarakhand, India

3

Department of EE, Institute of Engineering & Management, Kolkata, India

4

CVPR Unit, Indian Statistical Institute, Kolkata, India

Multimedia Tools and Applications

1 Introduction Handwritten text recognition is one of the most challenging problems in the field of pattern recognition. Due to the free-flow nature of handwriting and many writing variations, the recognition performance is not satisfactory even with sophisticated pre-processing and OCR techniques. While processing such handwritten documents, word spotting [20] techniques are useful to search the possible instances of specific/query words. For searching using “Word Spotting”, it does not require OCR of the entire document. The presence of writing distortion does not create much problem in retrieving similar target words as these approaches do not involve recognition of either the characters of the query word or the query word itself. The features are extracted from the whole word and thus the methods try to find similar features in the target images. One of the drawbacks of these methods is that these re