Depth2Action: Exploring Embedded Depth for Large-Scale Action Recognition

This paper performs the first investigation into depth for large-scale human action recognition in video where the depth cues are estimated from the videos themselves. We develop a new framework called depth2action and experiment thoroughly into how best

PDF / 2,130,213 Bytes
17 Pages / 439.37 x 666.142 pts Page_size
105 Downloads / 259 Views

DOWNLOAD

REPORT

Abstract. This paper performs the ﬁrst investigation into depth for large-scale human action recognition in video where the depth cues are estimated from the videos themselves. We develop a new framework called depth2action and experiment thoroughly into how best to incorporate the depth information. We introduce spatio-temporal depth normalization (STDN) to enforce temporal consistency in our estimated depth sequences. We also propose modiﬁed depth motion maps (MDMM) to capture the subtle temporal changes in depth. These two components signiﬁcantly improve the action recognition performance. We evaluate our depth2action framework on three large-scale action recognition video benchmarks. Our model achieves state-of-the-art performance when combined with appearance and motion information thus demonstrating that depth2action is indeed complementary to existing approaches.

Keywords: Action recognition

1

· Embedded depth

Introduction

Human action recognition in video is a fundamental problem in computer vision due to its increasing importance for a range of applications such as analyzing human activity, video search and recommendation, complex event understanding, etc. Much progress has been made over the past several years by employing hand-crafted local features such as improved dense trajectories (IDT) [39] or video representations that are learned directly from the data itself using deep convolutional neural networks (ConvNets). However, starting with the seminal two-stream ConvNets method [31], approaches have been limited to exploiting static visual information through frame-wise analysis and/or translational motion through optical ﬂow or 3D ConvNets. Further increase in performance on benchmark datasets has been mostly due to the higher capacity of deeper networks [23,43,44,46] or to recurrent neural networks which model long-term temporal dynamics [2,24,47]. Electronic supplementary material The online version of this chapter (doi:10. 1007/978-3-319-46604-0 47) contains supplementary material, which is available to authorized users. c Springer International Publishing Switzerland 2016 G. Hua and H. J´ egou (Eds.): ECCV 2016 Workshops, Part I, LNCS 9913, pp. 668–684, 2016. DOI: 10.1007/978-3-319-46604-0 47

Depth2Action: Exploring Embedded Depth for Action Recognition

669

Fig. 1. (a) “CricketBowling” and (b) “CricketShot”. Depth information about the bowler and the batters is key to telling these two classes apart. Our proposed depth2action approach exploits the depth information that is embedded in the videos to perform large-scale action recognition. This ﬁgure is best viewed in color (Color ﬁgure online)

Intuitively, depth can be an important cue for recognizing complex human actions. Depth information can help diﬀerentiate between action classes that are otherwise very similar especially with respect to appearance and translational motion in the red-green-blue (RGB) domain. For instance, the “CricketShot” and “CricketBowling” classes in the UCF101 dataset are often confused by the state-of-the-art

Data Loading...

Depth2Action: Exploring Embedded Depth for Large-Scale Action Recognition

Recommend Documents

Human Action Recognition with Depth Cameras

Temporal Distinct Representation Learning for Action Recognition

An Embedded Application for Degraded Text Recognition

Exploring Deep Gradient Information for Face Recognition

Video Action Recognition

Human action recognition based on 3D body mask and depth spatial-temporal maps

Exploring Blockchain in Speech Recognition

Human Action Recognition Without Human

Exploring Listening Strategy Instruction through Action Research

Implementation of Image Recognition on Embedded Systems

Exploring Discourse in Context and in Action

Exploring Multi-action Relationship in Reinforcement Learning