End-to-End Interpretation of the French Street Name Signs Dataset

We introduce the French Street Name Signs (FSNS) Dataset consisting of more than a million images of street name signs cropped from Google Street View images of France. Each image contains several views of the same street name sign. Every image has normal

  • PDF / 2,231,044 Bytes
  • 16 Pages / 439.37 x 666.142 pts Page_size
  • 23 Downloads / 191 Views

DOWNLOAD

REPORT


Abstract. We introduce the French Street Name Signs (FSNS) Dataset consisting of more than a million images of street name signs cropped from Google Street View images of France. Each image contains several views of the same street name sign. Every image has normalized, title case folded ground-truth text as it would appear on a map. We believe that the FSNS dataset is large and complex enough to train a deep network of significant complexity to solve the street name extraction problem “endto-end” or to explore the design trade-offs between a single complex engineered network and multiple sub-networks designed and trained to solve sub-problems. We present such an “end-to-end” network/graph for Tensor Flow and its results on the FSNS dataset.

Keywords: Deep networks Multiview dataset

1

·

End-to-end networks

·

Image dataset

·

Introduction

The detection and recognition of text from outdoor images is of increasing research interest to the fields of computer vision, machine learning and optical character recognition. The combination of perspective distortion, uncontrolled source text quality, and lack of significant structure to the text layout adds extra challenge to the still incompletely solved problem of accurately recognizing text from all the world’s languages. Demonstrating the interest, several datasets related to the problem have become available: including ICDAR 2003 Robust Reading [11], SVHN [13], and, more recently, COCO-Text [16], with details of these and others shown in Table 1. While these datasets each make a useful contribution to the field, the majority are very small compared to the size of a typical deep neural network. As the dataset size increases, it becomes increasingly difficult to maintain the accuracy of the ground-truth, as the task of annotating must be delegated to an increasingly large pool of workers less involved with the project. In the COCO-text [16] dataset for instance, the authors performed an audit themselves of the accuracy of the ground truth, and found that the annotators had found legible text regions with a recall of 84 %, and transcribed the text content with an accuracy c Springer International Publishing Switzerland 2016  G. Hua and H. J´ egou (Eds.): ECCV 2016 Workshops, Part I, LNCS 9913, pp. 411–426, 2016. DOI: 10.1007/978-3-319-46604-0 30

412

R. Smith et al.

of 87.5 %. Even at an edit distance of 1, the text content accuracy was still only 92.5 %, with missing punctuation being the largest remaining category of error. Synthetic data has been shown [8] to be a good solution to this problem and can work well provided the synthetic data generator includes the formatting/distortions that will be present in the target problem. Some real-world data however, by its very nature, can be hard to predict, so real data remains the first choice in many cases where available. The difficulty remains therefore, in generating a sufficiently accurately annotated, large enough dataset of real images, to satisfy the needs of modern datahungry deep network-based systems, which can learn