Residual Dual Scale Scene Text Spotting by Fusing Bottom-Up and Top-Down Processing

  • PDF / 4,344,174 Bytes
  • 19 Pages / 595.276 x 790.866 pts Page_size
  • 75 Downloads / 178 Views

DOWNLOAD

REPORT


Residual Dual Scale Scene Text Spotting by Fusing Bottom-Up and Top-Down Processing Wei Feng1,2 · Fei Yin1,2 · Xu-Yao Zhang1,2 · Wenhao He3 · Cheng-Lin Liu1,2,4 Received: 22 February 2020 / Accepted: 24 September 2020 © Springer Science+Business Media, LLC, part of Springer Nature 2020

Abstract Existing methods for arbitrary shaped text spotting can be divided into two categories: bottom-up methods detect and recognize local areas of text, and then group them into text lines or words; top-down methods detect text regions of interest, then apply polygon fitting and text recognition to the detected regions. In this paper, we analyze the advantages and disadvantages of these two methods, and propose a novel text spotter by fusing bottom-up and top-down processing. To detect text of arbitrary shapes, we employ a bottom-up detector to describe text with a series of rotated squares, and design a top-down detector to represent the region of interest with a minimum enclosing rotated rectangle. Then the text boundary is determined by fusing the outputs of two detectors. To connect arbitrary shaped text detection and recognition, we propose a differentiable operator named RoISlide, which can extract features for arbitrary text regions from whole image feature maps. Based on the extracted features through RoISlide, a CNN and CTC based text recognizer is introduced to make the framework free from character-level annotations. To improve the robustness against scale variance, we further propose a residual dual scale spotting mechanism, where two spotters work on different feature levels, and the high-level spotter is based on residuals of the low-level spotter. Our method has achieved state-of-the-art performance on four English datasets and one Chinese dataset, including both arbitrary shaped and oriented texts. We also provide abundant ablation experiments to analyze how the key components affect the performance. Keywords Scene text spotting · Arbitrary shapes · Bottom-up · Top-down · Residual dual scale

1 Introduction Communicated by Jiaya Jia.

B

Cheng-Lin Liu [email protected] Wei Feng [email protected] Fei Yin [email protected] Xu-Yao Zhang [email protected] Wenhao He [email protected]

1

National Laboratory of Pattern Recognition, Institute of Automation of Chinese Academy of Sciences, Beijing 100190, People’s Republic of China

2

School of Artificial Intelligence, University of Chinese Academy of Sciences, Beijing 100049, People’s Republic of China

3

Tencent Map Big Data Lab, Beijing 100193, People’s Republic of China

The goal of scene text spotting is to detect and recognize texts in natural scene images. As it plays an important role in document analysis and image understanding, it has attracted increasing attention in recent years. Besides the variability in text scales and orientations, scene text often appears in arbitrary shapes. However, most existing text spotting methods (Liu et al. 2018; He et al. 2018a; Li et al. 2017; Bušta et al. 2017) only focus on horizontal/oriented texts and thus ca