XwiseNet: action recognition with Xwise separable convolutions
- PDF / 762,917 Bytes
- 14 Pages / 439.642 x 666.49 pts Page_size
- 105 Downloads / 185 Views
XwiseNet: action recognition with Xwise separable convolutions Hefei Ling1 · Yao Chen1
· Jiazhong Chen1 · Lei Wu1 · Yuxuan Shi1 · Jing Deng2
Received: 8 October 2019 / Revised: 14 May 2020 / Accepted: 27 May 2020 / © Springer Science+Business Media, LLC, part of Springer Nature 2020
Abstract With the emergence of a large number of video resources, video action recognition is attracting much attention. Recently, realizing the outstanding performance of three-dimensional (3D) convolutional neural networks (CNNs), many works have began to apply it for action recognition and obtained satisfactory results. However, little attention has been paid to reduce the model size and computation cost of 3D CNNs. In this paper, we first propose a novel 3D convolution called the Xwise Separable Convolution, then we construct an original 3D CNN called the XwiseNet. Our work aims to make 3D CNNs lightweight without reducing its recognition accuracy. Our key idea is extremely decoupling the 3D convolution in channel, spatial and temporal dimensions. Experiments have verified that the XwiseNet outperforms 3D-ResNet-50 on the Mini-Kinetics benchmark with only 6% training parameters and 12% computation cost. Keywords Action recognition · Deep learning · Three-dimensional convolutional neural networks · Lightweight
1 Introduction Action recognition is attracting more and more attention in the field of computer vision. Due to the spatiotemporal characteristics of videos, spatiotemporal convolutions which we call three-dimensional (3D) convolutions do better than spatial convolutions, and the latter is called two-dimensional (2D) convolutions. C3D is the first model to use 3D convolutions in action recognition [26], which is called 3D convolutional neural network (CNN). Later many variants emerge which dramatically improve the accuracy of action recognition. 3D CNNs are outstanding in action recognition, but with huge computational overheads. 3D CNNs have many parameters, which leads to the need for more computing resources and training data for optimizing. For example, a 2D convolution of size 3 has 9 parameters Yao Chen
[email protected] 1
School of Computer Science and Technology, Huazhong University of Science and Technology, Wuhan, China
2
College of Computer Science and Technology, Zhejiang University, Hangzhou, China
Multimedia Tools and Applications
(we assume that the number of input channels is 1), while a 3D convolution of the same size has 27 parameters. When using a convolution of size 3 to convolve an input of size S, the FLOPs (floating-point operations) of 2D convolution are S 2 × 32 , while for 3D convolution are S 3 × 33 . It can be intuitively seen that 3D CNNs have much more parameters and computational requirements when the input and convolution are the same size compared with 2D CNNs. Now important issues follow that we need lots of computing resources and samples to train 3D CNNs. It is an inevitable trend that 3D CNNs are of strong demand for lightweight design. Although decomposed 3D convolutions [22, 2
Data Loading...