Robust Segmentation for Video Captions with Complex Backgrounds

Caption text contains rich information that can be used for video indexing and summarization. In this paper, we propose an effective caption text segmentation approach to improve OCR accuracy. Here, an AlexNet CNN is first trained with path signature for

  • PDF / 1,355,385 Bytes
  • 12 Pages / 439.37 x 666.14 pts Page_size
  • 4 Downloads / 182 Views

DOWNLOAD

REPORT


(

)

Department of Computer Science and Technology, School of Computer and Communication Engineering, University of Science and Technology Beijing, Beijing, China [email protected], [email protected], [email protected], [email protected]

Abstract. Caption text contains rich information that can be used for video indexing and summarization. In this paper, we propose an effective caption text segmentation approach to improve OCR accuracy. Here, an AlexNet CNN is first trained with path signature for text tracking. Then we utilize an improved adaptive thresholding method to segment caption text in individual frames. Finally, the multi-frame integration is conducted with gamma correction and region growing. In contrast to conventional methods which extract video text in individual frames independently, we exploit the specific temporal characteristics of videos to perform segmentation. Moreover, the proposed method can effectively remove the complex backgrounds with similar intensity to text. Experimental results on different videos and comparisons with other methods show the efficiency of our approach. Keywords: Caption text segmentation · Convolutional neural networks · Path signature · Multi-frame integration

1

Introduction

As the technology of digital multimedia develops rapidly, video has become one of the most popular media forms delivered via TV broadcasting and Internet. Text in video, especially caption text, contains rich information as a significant high-level semantic feature, which directly describes the video content. For instance, scores in sports program can help you quickly know the game situation, headlines in news videos summarize the content of news and captions in films conduce to better understanding of the storyline. Thus, caption text extraction and recognition play an important role in video indexing [1] and summarization. However, the state-of-the-art Optical Character Recognition (OCR) does not work well for text in video, although it has achieved excel‐ lent performance on printed documents. In contrast to binary document images, video text presents much more difficulties due to complex backgrounds, variety of text fonts and low contrast caused by the lossy compression. Before text recognition is implemented, there are generally three phases involved: text detection, text tracking and text segmentation. Text detection aims for locating the © Springer Nature Singapore Pte Ltd. 2016 T. Tan et al. (Eds.): CCPR 2016, Part II, CCIS 663, pp. 89–100, 2016. DOI: 10.1007/978-981-10-3005-5_8

90

Z.-H. Xing et al.

text region within a single video frame, while text tracking maintains the consistence of the text between consecutive frames and text segmentation intends to separate text from background (i.e., binarization). After segmentation, we can obtain a binary image, i.e., a mask, which is more adaptable for OCR engines. Although many methods have been proposed for video text segmentation in the past decades, most of them extract text only in individual frames independently. Even if