Video sketch: A middle-level representation for action recognition

  • PDF / 5,254,821 Bytes
  • 20 Pages / 595.224 x 790.955 pts Page_size
  • 94 Downloads / 239 Views

DOWNLOAD

REPORT


Video sketch: A middle-level representation for action recognition Xing-Yuan Zhang1 · Ya-Ping Huang1 · Yang Mi2 · Yan-Ting Pei1 · Qi Zou1 · Song Wang2

© Springer Science+Business Media, LLC, part of Springer Nature 2020

Abstract Different modalities extracted from videos, such as RGB and optical flows, may provide complementary cues for improving video action recognition. In this paper, we introduce a new modality named video sketch, which implies the human shape information, as a complementary modality for video action representation. We show that video action recognition can be enhanced by using the proposed video sketch. More specifically, we first generate video sketch with class distinctive action areas and then employ a two-stream network to combine the shape information extracted from image-based sketch and point-based sketch, followed by fusing the classification scores of two streams to generate shape representation for videos. Finally, we use the shape representation as the complementary one for the traditional appearance (RGB) and motion (optical flow) representations for the final video classification. We conduct extensive experiments on four human action recognition datasets – KTH, HMDB51, UCF101, Something-Something and UTI. The experimental results show that the proposed method outperforms the existing state-of-the-art action recognition methods. Keywords Action recognition · video sketch · attention model

1 Introduction Action recognition [1, 2] is a popular research topic in multimedia community and plays an important role in many multimedia applications, such as video surveillance [3, 4], human-robot interaction [5] and video summarization [6].

 Ya-Ping Huang

[email protected] Xing-Yuan Zhang [email protected] Yang Mi [email protected] Yan-Ting Pei [email protected] Qi Zou [email protected] Song Wang [email protected] 1

Beijing Key Laboratory of Traffic Data Analysis and Mining, Beijing Jiaotong University, Beijing 100044, China

2

Department of Computer Science and Engineering, University of South Carolina, Columbia, SC 29208, USA

Although it has been studied for years, resulting in many advanced action recognition algorithms [7, 8], it is far from being a solved problem, due to various complexities, such as large feature variations for each action, view angle changes, lighting differences and occlusions. Prior researches have revealed that the use of different modalities, such as RGB and optical flows, can benefit action recognition, even if they are all extracted from the original RGB image [9, 10]. Other modalities, e.g., depth map or human skeleton, can further promote action recognition [11, 12]. But its acquisition may require additional devices, such as MS Kinect sensors. Human action always involves the body-posture variation over time. Those variations are well reflected by the spatialtemporal change of the shape of the human’s body. Video sketch, which represents body shape variations, is a set of lines and points, whose shapes and positions can be tracked between frames