XwiseNet: action recognition with Xwise separable convolutions

PDF / 762,917 Bytes
14 Pages / 439.642 x 666.49 pts Page_size
105 Downloads / 204 Views

XwiseNet: action recognition with Xwise separable convolutions Hefei Ling1 · Yao Chen1

· Jiazhong Chen1 · Lei Wu1 · Yuxuan Shi1 · Jing Deng2

Received: 8 October 2019 / Revised: 14 May 2020 / Accepted: 27 May 2020 / © Springer Science+Business Media, LLC, part of Springer Nature 2020

Abstract With the emergence of a large number of video resources, video action recognition is attracting much attention. Recently, realizing the outstanding performance of three-dimensional (3D) convolutional neural networks (CNNs), many works have began to apply it for action recognition and obtained satisfactory results. However, little attention has been paid to reduce the model size and computation cost of 3D CNNs. In this paper, we first propose a novel 3D convolution called the Xwise Separable Convolution, then we construct an original 3D CNN called the XwiseNet. Our work aims to make 3D CNNs lightweight without reducing its recognition accuracy. Our key idea is extremely decoupling the 3D convolution in channel, spatial and temporal dimensions. Experiments have verified that the XwiseNet outperforms 3D-ResNet-50 on the Mini-Kinetics benchmark with only 6% training parameters and 12% computation cost. Keywords Action recognition · Deep learning · Three-dimensional convolutional neural networks · Lightweight

1 Introduction Action recognition is attracting more and more attention in the field of computer vision. Due to the spatiotemporal characteristics of videos, spatiotemporal convolutions which we call three-dimensional (3D) convolutions do better than spatial convolutions, and the latter is called two-dimensional (2D) convolutions. C3D is the first model to use 3D convolutions in action recognition [26], which is called 3D convolutional neural network (CNN). Later many variants emerge which dramatically improve the accuracy of action recognition. 3D CNNs are outstanding in action recognition, but with huge computational overheads. 3D CNNs have many parameters, which leads to the need for more computing resources and training data for optimizing. For example, a 2D convolution of size 3 has 9 parameters Yao Chen

[email protected] 1

School of Computer Science and Technology, Huazhong University of Science and Technology, Wuhan, China

2

College of Computer Science and Technology, Zhejiang University, Hangzhou, China

Multimedia Tools and Applications

(we assume that the number of input channels is 1), while a 3D convolution of the same size has 27 parameters. When using a convolution of size 3 to convolve an input of size S, the FLOPs (floating-point operations) of 2D convolution are S 2 × 32 , while for 3D convolution are S 3 × 33 . It can be intuitively seen that 3D CNNs have much more parameters and computational requirements when the input and convolution are the same size compared with 2D CNNs. Now important issues follow that we need lots of computing resources and samples to train 3D CNNs. It is an inevitable trend that 3D CNNs are of strong demand for lightweight design. Although decomposed 3D convolutions [22, 2

Data Loading...

XwiseNet: action recognition with Xwise separable convolutions

Recommend Documents

Position-aware lightweight object detectors with depthwise separable convolutions

Efficient Pediatric Pneumonia Diagnosis Using Depthwise Separable Convolutions

Human Action Recognition with Depth Cameras

Video Action Recognition

Cucumber Disease Recognition Based on Depthwise Separable Convolution

Human Action Recognition Without Human

Few-Shot Action Recognition with Permutation-Invariant Attention

Decoupling GCN with DropGraph Module for Skeleton-Based Action Recognition

Skeleton-Based Action Recognition with Dense Spatial Temporal Graph Network

Skeleton-Based Human Action Recognition with Profile Hidden Markov Models

Temporal Distinct Representation Learning for Action Recognition

Human Action Recognition Using STIP Evaluation Techniques