Scene Text Recognition and Retrieval for Large Lexicons

In this paper we propose a framework for recognition and retrieval tasks in the context of scene text images. In contrast to many of the recent works, we focus on the case where an image-specific list of words, known as the small lexicon setting, is unava

  • PDF / 717,650 Bytes
  • 15 Pages / 439.37 x 666.142 pts Page_size
  • 67 Downloads / 197 Views

DOWNLOAD

REPORT


2

CVIT, IIIT Hyderabad, Hyderabad, India [email protected] Inria, LEAR team, Inria Grenoble Rhˆ one-Alpes, Laboratoire Jean Kuntzmann, CNRS, Univ. Grenoble Alpes, Saint-Martin-d’H´eres, France

Abstract. In this paper we propose a framework for recognition and retrieval tasks in the context of scene text images. In contrast to many of the recent works, we focus on the case where an image-specific list of words, known as the small lexicon setting, is unavailable. We present a conditional random field model defined on potential character locations and the interactions between them. Observing that the interaction potentials computed in the large lexicon setting are less effective than in the case of a small lexicon, we propose an iterative method, which alternates between finding the most likely solution and refining the interaction potentials. We evaluate our method on public datasets and show that it improves over baseline and state-of-the-art approaches. For example, we obtain nearly 15 % improvement in recognition accuracy and precision for our retrieval task over baseline methods on the IIIT-5K word dataset, with a large lexicon containing 0.5 million words.

1

Introduction

Text can play an important role in understanding street view images. In light of this, many attempts have been made to recognize scene text [1–6]. Scene text recognition is a challenging problem and its recent success is mostly limited to the small lexicon setting, where an image-specific lexicon containing the ground truth word is provided. Typically, these lexicons contain only 50 words [3]. This setting has many practical applications, but it does not scale well. As an example consider the scenario of assisting visually-impaired people in finding books by their titles in a library. Here the lexicon is populated with all the book titles. In this case, the small lexicon setting becomes less accurate as the lexicon sizes can range from a few thousands to a million. For instance, when lexicon size increases from 50 to 1000, the recognition accuracy drops by more than 10 % [6,7]. In other words, the general problem of scene text recognition, i.e., recognition with the help of a large lexicon (say a million dictionary words) is far from being solved. In this paper, we investigate this problem. One way to address the task of recognizing scene text is to pose the problem in conditional random field (crf) framework and obtain the maximum a posteriori (map) solution as proposed in [3,4,7–10]. In these frameworks, c Springer International Publishing Switzerland 2015  D. Cremers et al. (Eds.): ACCV 2014, Part I, LNCS 9003, pp. 494–508, 2015. DOI: 10.1007/978-3-319-16865-4 32

Scene Text Recognition and Retrieval for Large Lexicons Word Image

495

Top-5 diverse solutions (ranked) PITA, PASP, ENEP, PITT, AWAP AUM, NIM, COM, MUA, PLL MINSTER, MINSHER, GRINNER, MINISTR, MONSTER BRKE, BNKE, BIKE, BAKE, BOKE TOLS, TARS, THIS, TOHE, TALP

Fig. 1. Examples where the map solution is incorrect, as the pairwise priors become too generic when computed from