Multi-cue based 3D residual network for action recognition

  • PDF / 798,024 Bytes
  • 15 Pages / 595.276 x 790.866 pts Page_size
  • 104 Downloads / 202 Views

DOWNLOAD

REPORT


(0123456789().,-volV)(0123456789(). ,- volV)

ORIGINAL ARTICLE

Multi-cue based 3D residual network for action recognition Ming Zong1,2 • Ruili Wang1,2 • Zhe Chen3 • Maoli Wang4



Xun Wang1,2 • Johan Potgieter5

Received: 19 March 2020 / Accepted: 19 August 2020 Ó Springer-Verlag London Ltd., part of Springer Nature 2020

Abstract Convolutional neural network (CNN) is a natural structure for video modelling that has been successfully applied in the field of action recognition. The existing 3D CNN-based action recognition methods mainly perform 3D convolutions on individual cues (e.g. appearance and motion cues) and rely on the design of subsequent networks to fuse these cues together. In this paper, we propose a novel multi-cue 3D convolutional neural network (M3D), which integrates three individual cues (i.e. an appearance cue, a direct motion cue, and a salient motion cue) directly. Different from the existing methods, the proposed M3D model directly performs 3D convolutions on multiple cues instead of a single cue. Compared with the previous methods, this model can obtain more discriminative and robust features by integrating three different cues as a whole. Further, we propose a novel residual multi-cue 3D convolution model (R-M3D) to improve the representation ability to obtain representative video features. Experimental results verify the effectiveness of proposed M3D model, and the proposed R-M3D model (pre-trained on the Kinetics dataset) achieves competitive performance compared with the state-of-the-art models on UCF101 and HMDB51 datasets. Keywords Action recognition  Multi-cue  3D convolution  Salient motion cue  Residual

1 Introduction Human action recognition aims to automatically identify specified actions in a video [29, 30]. It has many applications such as intelligent video surveillance, human-computer interaction, human behaviour analysis, and smart hospital care [32, 34, 37, 40, 41, 44]. Different from images that only contain an appearance cue, videos contain not only the appearance cue extracted from still video frames but also a motion cue extracted from stacked video frames.

& Maoli Wang [email protected] 1

School of Computer and Information Engineering, Zhejiang Gongshang University, Hangzhou, China

2

School of Natural and Computational Sciences, Massey University, Auckland, New Zealand

3

College of Computer and Information, Hohai University, Nanjing, China

4

School of Information Science and Engineering, Qufu Normal University, Rizhao 276800, China

5

School of Food and Advanced Technology, Massey University, Auckland, New Zealand

Therefore, the motion cue plays an important role in action recognition. Compared with traditional shallow hand-crafted models [53, 54], convolutional neural networks (CNNs) [47, 48, 59, 61] have shown a superior ability to capture appearance information in many visual-related tasks such as image classification [27, 58], object detection [13], and image segmentation [31]. To take advantage of CNNs, many 2D CNN-based methods [4, 9, 25, 45, 51, 57