Detecting Text in Natural Image with Connectionist Text Proposal Network

We propose a novel Connectionist Text Proposal Network (CTPN) that accurately localizes text lines in natural image. The CTPN detects a text line in a sequence of fine-scale text proposals directly in convolutional feature maps. We develop a vertical anch

  • PDF / 2,666,354 Bytes
  • 17 Pages / 439.37 x 666.142 pts Page_size
  • 111 Downloads / 339 Views

DOWNLOAD

REPORT


Shenzhen Key Lab of Computer Vision and Pattern Recognition, Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, Shenzhen, China 2 University of Oxford, Oxford, UK 3 The Chinese University of Hong Kong, Sha Tin, Hong Kong {zhi.tian,wl.huang,tong.he,pan.he,yu.qiao}@siat.ac.cn

Abstract. We propose a novel Connectionist Text Proposal Network (CTPN) that accurately localizes text lines in natural image. The CTPN detects a text line in a sequence of fine-scale text proposals directly in convolutional feature maps. We develop a vertical anchor mechanism that jointly predicts location and text/non-text score of each fixed-width proposal, considerably improving localization accuracy. The sequential proposals are naturally connected by a recurrent neural network, which is seamlessly incorporated into the convolutional network, resulting in an end-to-end trainable model. This allows the CTPN to explore rich context information of image, making it powerful to detect extremely ambiguous text. The CTPN works reliably on multi-scale and multilanguage text without further post-processing, departing from previous bottom-up methods requiring multi-step post filtering. It achieves 0.88 and 0.61 F-measure on the ICDAR 2013 and 2015 benchmarks, surpassing recent results [8, 35] by a large margin. The CTPN is computationally efficient with 0.14 s/image, by using the very deep VGG16 model [27]. Online demo is available: http://textdet.com/. Keywords: Scene text detection · Convolutional network neural network · Anchor mechanism

1

·

Recurrent

Introduction

Reading text in natural image has recently attracted increasing attention in computer vision [1,8–11,14,15,28,32,35]. This is due to its numerous practical applications such as image OCR, multi-language translation, image retrieval, etc. It includes two sub tasks: text detection and recognition. This work focus on the detection task [1,14,28,32], which is more challenging than recognition task carried out on a well-cropped word image [9,15]. Large variance of text patterns and highly cluttered background pose main challenge of accurate text localization. c Springer International Publishing AG 2016  B. Leibe et al. (Eds.): ECCV 2016, Part VIII, LNCS 9912, pp. 56–72, 2016. DOI: 10.1007/978-3-319-46484-8 4

Detecting Text in Natural Image with CTPN

(a)

57

(b)

Fig. 1. (a) Architecture of the Connectionist Text Proposal Network (CTPN). We densely slide a 3×3 spatial window through the last convolutional maps (conv5 ) of the VGG16 model [27]. The sequential windows in each row are recurrently connected by a Bi-directional LSTM (BLSTM) [7], where the convolutional feature (3×3×C) of each window is used as input of the 256D BLSTM (including two 128D LSTMs). The RNN layer is connected to a 512D fully-connected layer, followed by the output layer, which jointly predicts text/non-text scores, y-axis coordinates and side-refinement offsets of k anchors. (b) The CTPN outputs sequential fixed-width fine-scale text proposals. Color of each box indicates the text/non-text score.