Downtown Osaka Scene Text Dataset

This paper presents a new scene text dataset named Downtown Osaka Scene Text Dataset (in short, DOST dataset). The dataset consists of sequential images captured in shopping streets in downtown Osaka with an omnidirectional camera. Unlike most of existing

  • PDF / 18,712,529 Bytes
  • 16 Pages / 439.37 x 666.142 pts Page_size
  • 109 Downloads / 301 Views

DOWNLOAD

REPORT


Abstract. This paper presents a new scene text dataset named Downtown Osaka Scene Text Dataset (in short, DOST dataset). The dataset consists of sequential images captured in shopping streets in downtown Osaka with an omnidirectional camera. Unlike most of existing datasets consisting of scene images intentionally captured, DOST dataset consists of uncontrolled scene images; use of an omnidirectional camera enabled us to capture videos (sequential images) of whole scenes surrounding the camera. Since the dataset preserved the real scenes containing texts as they were, in other words, they are scene texts in the wild. DOST dataset contained 32,147 manually ground truthed sequential images. They contained 935,601 text regions consisting of 797,919 legible and 137,682 illegible. The legible regions contained 2,808,340 characters. The dataset is evaluated using two existing scene text detection methods and one powerful commercial end-to-end scene text recognition method to know the difficulty and quality in comparison with existing datasets. Keywords: Scene text in the wild · Uncontrolled scene text rectional camera · Sequential image · Video · Japanese text

1

· Omnidi-

Introduction

Text plays important roles in our life. Imagining life in a world without text, in which, for example, neither book, newspaper, signboard, menu in a restaurant, texting on smartphone nor program source code exists or they exist in a completely different form, we can rediscovery not only the necessity of text but also importance of reading and interpreting text. Although only human being has been endowed with the ability of reading and interpreting text, researchers have struggled to enable computers to read text. Focusing on camera-captured text and scene text, some pioneer works were presented in 1990s [21]. Since then, increasing attention was paid for recognizing scene text. Table 1 shows remarkable recent progress of scene text recognition techniques. In the table, most of reported accuracies of the latest methods exceeded 90 % on major benchmark datasets. However, does this mean these methods are powerful enough to read a variety of texts in the real environment? Many people would agree that the answer is no. Text images contained in these c Springer International Publishing Switzerland 2016  G. Hua and H. J´ egou (Eds.): ECCV 2016 Workshops, Part I, LNCS 9913, pp. 440–455, 2016. DOI: 10.1007/978-3-319-46604-0 32

Downtown Osaka Scene Text Dataset

441

Table 1. Recent improvement of recognition performance in scene text recognition tasks. Based on Table 1 of [1], this table summarizes recognition accuracies of recent methods in percentage terms on representative benchmark datasets in the chronological order. “50,” “1k” and “50k” represent lexicon sizes. “Full” and “None” represent with all per-image lexicon words and without lexicon, respectively. Year

Method

IIIT5K [2]

Lexicon

50

-

ABBYY [3]

2011

Wang et al. [3]

2012

2013 2014

2015

2016

SVT [3]

ICDAR03 [4]

None

50

50

24.3 -

-

35.0 -

-

-

57.0 -

Mishra et al.

Data Loading...