Consistent constraint-based video-level learning for action recognition

  • PDF / 1,342,627 Bytes
  • 14 Pages / 595 x 794 pts Page_size
  • 45 Downloads / 203 Views

DOWNLOAD

REPORT


Shi et al. EURASIP Journal on Image and Video Processing https://doi.org/10.1186/s13640-020-00519-1

EURASIP Journal on Image and Video Processing

RESEARCH

Open Access

Consistent constraint-based video-level learning for action recognition Qinghongya Shi1,2,3 , Hong-Bo Zhang1,2,3* *Correspondence: [email protected] 1 Department of Computer Science and Technology, Huaqiao University, Xiamen, Fujian, China 2 Fujian Key Laboratory of Big Data Intelligence and Security, Huaqiao University, Xiamen, Fujian, China Full list of author information is available at the end of the article

, Hao-Tian Ren1,2,3 , Ji-Xiang Du1,2,3 and Qing Lei1,2,3

Abstract This paper proposes a new neural network learning method to improve the performance for action recognition in video. Most human action recognition methods use a clip-level training strategy, which divides the video into multiple clips and trains the feature learning network by minimizing the loss function of clip classification. The video category is predicted by the voting of clips from the same video. In order to obtain more effective action feature, a new video-level feature learning method is proposed to train 3D CNN to boost the action recognition performance. Different with clip-level training which uses clips as input, video-level learning network uses the entire video as the input. Consistent constraint loss is defined to minimize the distance between clips of the same video in voting space. Further, a video-level loss function is defined to compute the video classification error. The experimental results show that the proposed video-level training is a more effective action feature learning approach compared with the clip-level training. And this paper has achieved the state-of-the-art performance on UCF101 and HMDB51 datasets without using pre-trained models of other large-scale datasets. Our code and final model are available at https://github. com/hqu-cst-mmc/VLL. Keywords: Consistent constraint, Video-level learning, 3D CNN, Action recognition, Loss function

1 Introduction Action recognition has gradually become a research hotspot in computer vision and pattern recognition, which is widely applied in intelligent video surveillance, virtual reality, motion analysis, and video retrieval. How to improve the accuracy of human action recognition has been studied by many researchers. Many methods have been proposed to recognize action in video in recent years. The key to these methods is to learn effective action feature from input data. Several different neural networks are employed in these methods, such as 3D convolutional neural network (ConvNets) [1, 2], multi-stream 2D ConvNets [3–5], and recurrent neural network [6, 7]. The difference between video feature and image feature is whether it contains temporal information. To deal with the different temporal length of video and reduce computational complexity, the input video is divided into a clip set. Each clip has the same number of frames, and the video label is assigned to each clip. In the training sta