Multi-cue based 3D residual network for action recognition

PDF / 798,024 Bytes
15 Pages / 595.276 x 790.866 pts Page_size
104 Downloads / 221 Views

(0123456789().,-volV)(0123456789(). ,- volV)

ORIGINAL ARTICLE

Multi-cue based 3D residual network for action recognition Ming Zong1,2 • Ruili Wang1,2 • Zhe Chen3 • Maoli Wang4

•

Xun Wang1,2 • Johan Potgieter5

Received: 19 March 2020 / Accepted: 19 August 2020 Ó Springer-Verlag London Ltd., part of Springer Nature 2020

Abstract Convolutional neural network (CNN) is a natural structure for video modelling that has been successfully applied in the field of action recognition. The existing 3D CNN-based action recognition methods mainly perform 3D convolutions on individual cues (e.g. appearance and motion cues) and rely on the design of subsequent networks to fuse these cues together. In this paper, we propose a novel multi-cue 3D convolutional neural network (M3D), which integrates three individual cues (i.e. an appearance cue, a direct motion cue, and a salient motion cue) directly. Different from the existing methods, the proposed M3D model directly performs 3D convolutions on multiple cues instead of a single cue. Compared with the previous methods, this model can obtain more discriminative and robust features by integrating three different cues as a whole. Further, we propose a novel residual multi-cue 3D convolution model (R-M3D) to improve the representation ability to obtain representative video features. Experimental results verify the effectiveness of proposed M3D model, and the proposed R-M3D model (pre-trained on the Kinetics dataset) achieves competitive performance compared with the state-of-the-art models on UCF101 and HMDB51 datasets. Keywords Action recognition Multi-cue 3D convolution Salient motion cue Residual

1 Introduction Human action recognition aims to automatically identify specified actions in a video [29, 30]. It has many applications such as intelligent video surveillance, human-computer interaction, human behaviour analysis, and smart hospital care [32, 34, 37, 40, 41, 44]. Different from images that only contain an appearance cue, videos contain not only the appearance cue extracted from still video frames but also a motion cue extracted from stacked video frames.

& Maoli Wang [email protected] 1

School of Computer and Information Engineering, Zhejiang Gongshang University, Hangzhou, China

2

School of Natural and Computational Sciences, Massey University, Auckland, New Zealand

3

College of Computer and Information, Hohai University, Nanjing, China

4

School of Information Science and Engineering, Qufu Normal University, Rizhao 276800, China

5

School of Food and Advanced Technology, Massey University, Auckland, New Zealand

Therefore, the motion cue plays an important role in action recognition. Compared with traditional shallow hand-crafted models [53, 54], convolutional neural networks (CNNs) [47, 48, 59, 61] have shown a superior ability to capture appearance information in many visual-related tasks such as image classification [27, 58], object detection [13], and image segmentation [31]. To take advantage of CNNs, many 2D CNN-based methods [4, 9, 25, 45, 51, 57

Data Loading...

Multi-cue based 3D residual network for action recognition

Recommend Documents

Human Action Recognition Algorithm Based on 3D DenseNet-BC

Integrating Gaussian mixture model and dilated residual network for action recognition in videos

Environmental Sound Recognition Based on Residual Network and Stacking Algorithm

Video Action Recognition Based on Hybrid Convolutional Network

Skeleton-Based Action Recognition with Dense Spatial Temporal Graph Network

Enhanced 3D residual network for video event recognition in shipping monitoring

RubiksNet: Learnable 3D-Shift for Efficient Video Action Recognition

Adversarial Self-supervised Learning for Semi-supervised 3D Action Recognition

Face Recognition, 3D-Based

Spatial-Temporal Co-attention Network for Action Recognition

DDGCN: A Dynamic Directed Graph Convolutional Network for Action Recognition

Declarative Residual Network for Robust Facial Expression Recognition