Dynamic Lexicon Generation for Natural Scene Images

Many scene text understanding methods approach the end-to-end recognition problem from a word-spotting perspective and take huge benefit from using small per-image lexicons. Such customized lexicons are normally assumed as given and their source is rarely

  • PDF / 3,461,520 Bytes
  • 16 Pages / 439.37 x 666.142 pts Page_size
  • 101 Downloads / 275 Views

DOWNLOAD

REPORT


2

CVIT IIIT, Hyderabad, India [email protected] Computer Vision Center, Universitat Aut` onoma de Barcelona, Barcelona, Spain {lgomez,marcal,dimos}@cvc.uab.es

Abstract. Many scene text understanding methods approach the endto-end recognition problem from a word-spotting perspective and take huge benefit from using small per-image lexicons. Such customized lexicons are normally assumed as given and their source is rarely discussed. In this paper we propose a method that generates contextualized lexicons for scene images using only visual information. For this, we exploit the correlation between visual and textual information in a dataset consisting of images and textual content associated with them. Using the topic modeling framework to discover a set of latent topics in such a dataset allows us to re-rank a fixed dictionary in a way that prioritizes the words that are more likely to appear in a given image. Moreover, we train a CNN that is able to reproduce those word rankings but using only the image raw pixels as input. We demonstrate that the quality of the automatically obtained custom lexicons is superior to a generic frequency-based baseline. Keywords: Scene text · Photo OCR generation · Topic modeling · CNN

1

· Scene understanding · Lexicon

Introduction

Reading systems for text understanding in the wild have shown a remarkable increase in performance over the past five years [1,2]. However, the problem is still far from being considered solved with the best reported methods achieving end-to-end recognition performances of 87 % in focused text scenarios [3,4] and 53% in the more difficult problem of incidental text [5]. The best performing end-to-end scene text understanding methodologies address the problem from a word spotting perspective and take a huge benefit from using customized lexicons. The size and quality of these custom lexicons has been shown to have a strong effect in the recognition performance [6]. The source of such per-image customized lexicons is rarely discussed. In most academic settings such custom lexicons are artificially created and provided to the algorithm as a form of predefined word queries. But, in real life scenarios lexicons need to be dynamically constructed. c Springer International Publishing Switzerland 2016  G. Hua and H. J´ egou (Eds.): ECCV 2016 Workshops, Part I, LNCS 9913, pp. 395–410, 2016. DOI: 10.1007/978-3-319-46604-0 29

396

Y. Patel et al.

In one of the few examples in literature, Wang et al. [7] used Google’s “search nearby” functionality to built custom lexicons of businesses that might appear in Google Street View images. In the document analysis domain, different techniques for adapting the language models to take into account the context of the document have been used, such as language model adaptation [8] and full-book recognition techniques [9]. Such approaches are nevertheless only feasible on relatively large corpuses were word statistics can be effectively calculated and are not applicable to scene images where text is scarce. On the other han