Spatiotemporal saliency-based multi-stream networks with attention-aware LSTM for action recognition
- PDF / 1,268,036 Bytes
- 10 Pages / 595.276 x 790.866 pts Page_size
- 7 Downloads / 219 Views
(0123456789().,-volV)(0123456789(). ,- volV)
S.I. : DEEP LEARNING APPROACHES FOR REALTIME IMAGE SUPER RESOLUTION (DLRSR)
Spatiotemporal saliency-based multi-stream networks with attentionaware LSTM for action recognition Zhenbing Liu1 • Zeya Li1 • Ruili Wang2,3 • Ming Zong2,3
•
Wanting Ji2,3
Received: 6 November 2019 / Accepted: 17 June 2020 Ó Springer-Verlag London Ltd., part of Springer Nature 2020
Abstract Human action recognition is a process of labeling video frames with action labels. It is a challenging research topic since the background of videos is usually chaotic, which will reduce the performance of traditional human action recognition methods. In this paper, we propose a novel spatiotemporal saliency-based multi-stream ResNets (STS), which combines three streams (i.e., a spatial stream, a temporal stream and a spatiotemporal saliency stream) for human action recognition. Further, we propose a novel spatiotemporal saliency-based multi-stream ResNets with attention-aware long short-term memory (STS-ALSTM) network. The proposed STS-ALSTM model combines deep convolutional neural network (CNN) feature extractors with three attention-aware LSTMs to capture the temporal long-term dependency relationships between consecutive video frames, optical flow frames or spatiotemporal saliency frames. Experimental results on UCF-101 and HMDB-51 datasets demonstrate that our proposed STS method and STS-ALSTM model obtain competitive performance compared with the state-of-the-art methods. Keywords Spatiotemporal saliency Multi-stream Attention-aware LSTM Action recognition
1 Introduction Human action recognition is a process of labeling video frames with action labels [10, 29, 41, 52]. It has a wide range of applications in real life such as intelligent surveillance, virtual reality (VR), video retrieval, intelligent human–computer interaction and shopping behavior analysis. Conventional handcrafted feature-based human action recognition methods cannot fully extract efficient and robust features from videos, especially when there are complex clutter backgrounds in the videos such as target & Ming Zong [email protected] 1
School of Computer Science and Information Security, Guilin University of Electronic Technology, Guilin, China
2
School of Natural and Computational Sciences, Massey University, Auckland, New Zealand
3
School of Computer and Information Engineering, Zhejiang Gongshang University, Hangzhou, China
occlusion, illumination variation and camera movement. To address this challenge, deep convolutional neural network (CNN)-based human action recognition methods have been developed, which can be categorized into three categories: (i) two-stream convolutional neural networkbased methods [10, 41, 50], (ii) 3D convolutional neural network-based methods [8, 16, 47] and (iii) recurrent neural network-based methods. Typically, a two-stream convolutional neural network consists of two streams: a spatial stream and a temporal stream. The spatial stream is used to capture the appearance information
Data Loading...