Spatiotemporal attention enhanced features fusion network for action recognition

PDF / 3,646,881 Bytes
19 Pages / 595.276 x 790.866 pts Page_size
102 Downloads / 301 Views

ORIGINAL ARTICLE

Spatiotemporal attention enhanced features fusion network for action recognition Danfeng Zhuang1 · Min Jiang1 · Jun Kong1 · Tianshan Liu2 Received: 19 February 2020 / Accepted: 18 September 2020 © Springer-Verlag GmbH Germany, part of Springer Nature 2020

Abstract In recent years, action recognition has become a popular and challenging task in computer vision. Nowadays, two-stream networks with appearance stream and motion stream can make judgment jointly and get excellent action classification results. But many of these networks fused the features or scores simply, and the characteristics in different streams were not utilized effectively. Meanwhile, the spatial context and temporal information were not fully utilized and processed in some networks. In this paper, a novel three-stream network spatiotemporal attention enhanced features fusion network for action recognition is proposed. Firstly, features fusion stream which includes multi-level features fusion blocks, is designed to train the two streams jointly and complement the two-stream network. Secondly, we model the channel features obtained by spatial context to enhance the ability to extract useful spatial semantic features at different levels. Thirdly, a temporal attention module which can model the temporal information makes the extracted temporal features more representative. A large number of experiments are performed on UCF101 dataset and HMDB51 dataset, which verify the effectiveness of our proposed network for action recognition. Keywords Action recognition · Three-stream · Spatiotemporal attention · Features fusion

1 Introduction The target of action recognition is to analyze the actions executed by the targets automatically. In earlier studies, manual features are applied in action recognition widely. Some manual feature-based methods can extract useful features in the videos and achieve excellent performance. People have tried to get more spatiotemporal local features in these methods. Some of these methods constructed local feature descriptors and extracted motion information around the interest points. Then, we could obtain the local feature vectors, such as cuboids [1], histogram of gradient and histogram of flow (HOG/HOF) [2], extended SURF (ESURF) [3] descriptors. Meanwhile, spatiotemporal trajectory-based * Min Jiang [email protected] 1

Jiangsu Provincial Engineering Laboratory of Pattern Recognition and Computational Intelligence, Jiangnan University, Wuxi 214122, China

Department of Electronic and Information Engineering, The Hong Kong Polytechnic University, Hong Kong 999077, China

2

action recognition methods such as [4] were an extension of local feature points in time and space. By tracking the key points of moving objects, [4] constructed more powerful local features. This method which was based on dense trajectories, has achieved good results in many public action recognition datasets. With the development of deep learning, many deep learning networks [5–7] are utilized to extract features effe

Data Loading...

Spatiotemporal attention enhanced features fusion network for action recognition

Recommend Documents

Spatial-Temporal Co-attention Network for Action Recognition

Spatiotemporal saliency-based multi-stream networks with attention-aware LSTM for action recognition

Sensor fusion based manipulative action recognition

A Part Fusion Model for Action Recognition in Still Images

Vehicle theft recognition from surveillance video based on spatiotemporal attention

Human Action Recognition Method Based on Video-Level Features and Attention Mechanism

Knowledge Graph Attention Network Enhanced Sequential Recommendation

Few-Shot Action Recognition with Permutation-Invariant Attention

Spatiotemporal Attention Autoencoder (STAAE) for ADHD Classification

DDGCN: A Dynamic Directed Graph Convolutional Network for Action Recognition

Multi-cue based 3D residual network for action recognition

An Attention Enhanced Graph Convolutional Network for Semantic Segmentation