Spatiotemporal saliency-based multi-stream networks with attention-aware LSTM for action recognition

PDF / 1,268,036 Bytes
10 Pages / 595.276 x 790.866 pts Page_size
7 Downloads / 279 Views

(0123456789().,-volV)(0123456789(). ,- volV)

S.I. : DEEP LEARNING APPROACHES FOR REALTIME IMAGE SUPER RESOLUTION (DLRSR)

Spatiotemporal saliency-based multi-stream networks with attentionaware LSTM for action recognition Zhenbing Liu1 • Zeya Li1 • Ruili Wang2,3 • Ming Zong2,3

•

Wanting Ji2,3

Received: 6 November 2019 / Accepted: 17 June 2020 Ó Springer-Verlag London Ltd., part of Springer Nature 2020

Abstract Human action recognition is a process of labeling video frames with action labels. It is a challenging research topic since the background of videos is usually chaotic, which will reduce the performance of traditional human action recognition methods. In this paper, we propose a novel spatiotemporal saliency-based multi-stream ResNets (STS), which combines three streams (i.e., a spatial stream, a temporal stream and a spatiotemporal saliency stream) for human action recognition. Further, we propose a novel spatiotemporal saliency-based multi-stream ResNets with attention-aware long short-term memory (STS-ALSTM) network. The proposed STS-ALSTM model combines deep convolutional neural network (CNN) feature extractors with three attention-aware LSTMs to capture the temporal long-term dependency relationships between consecutive video frames, optical flow frames or spatiotemporal saliency frames. Experimental results on UCF-101 and HMDB-51 datasets demonstrate that our proposed STS method and STS-ALSTM model obtain competitive performance compared with the state-of-the-art methods. Keywords Spatiotemporal saliency Multi-stream Attention-aware LSTM Action recognition

1 Introduction Human action recognition is a process of labeling video frames with action labels [10, 29, 41, 52]. It has a wide range of applications in real life such as intelligent surveillance, virtual reality (VR), video retrieval, intelligent human–computer interaction and shopping behavior analysis. Conventional handcrafted feature-based human action recognition methods cannot fully extract efficient and robust features from videos, especially when there are complex clutter backgrounds in the videos such as target & Ming Zong [email protected] 1

School of Computer Science and Information Security, Guilin University of Electronic Technology, Guilin, China

2

School of Natural and Computational Sciences, Massey University, Auckland, New Zealand

3

School of Computer and Information Engineering, Zhejiang Gongshang University, Hangzhou, China

occlusion, illumination variation and camera movement. To address this challenge, deep convolutional neural network (CNN)-based human action recognition methods have been developed, which can be categorized into three categories: (i) two-stream convolutional neural networkbased methods [10, 41, 50], (ii) 3D convolutional neural network-based methods [8, 16, 47] and (iii) recurrent neural network-based methods. Typically, a two-stream convolutional neural network consists of two streams: a spatial stream and a temporal stream. The spatial stream is used to capture the appearance information

Data Loading...

Spatiotemporal saliency-based multi-stream networks with attention-aware LSTM for action recognition

Recommend Documents

Graph-Temporal LSTM Networks for Skeleton-Based Action Recognition

Spatiotemporal attention enhanced features fusion network for action recognition

Spatio-Temporal LSTM with Trust Gates for 3D Human Action Recognition

UIC Code Recognition Using Computer Vision and LSTM Networks

CFAD: Coarse-to-Fine Action Detector for Spatiotemporal Action Localization

Sliding Covariance Matrix: Co-learning Spatiotemporal Geometry Feature for Skeleton Based Action Recognition

Decoupling GCN with DropGraph Module for Skeleton-Based Action Recognition

Human Action Recognition with Depth Cameras

XwiseNet: action recognition with Xwise separable convolutions

Segmental Spatiotemporal CNNs for Fine-Grained Action Segmentation

Temporal Distinct Representation Learning for Action Recognition

Video Action Recognition