Human Interaction Prediction Using Deep Temporal Features
Interaction prediction has a wide range of applications such as robot controlling and prevention of dangerous events. In this paper, we introduce a new method to capture deep temporal information in videos for human interaction prediction. We propose to u
- PDF / 2,505,839 Bytes
- 12 Pages / 439.37 x 666.142 pts Page_size
- 22 Downloads / 197 Views
School of Computer Science and Software Engineering, The University of Western Australia, Crawley, Australia [email protected], {mohammed.bennamoun,senjian.an,farid.boussaid}@uwa.edu.au 2 School of Electrical, Electronic and Computer Engineering, The University of Western Australia, Crawley, Australia 3 School of Engineering and Information Technology, Murdoch University, Murdoch, Australia [email protected]
Abstract. Interaction prediction has a wide range of applications such as robot controlling and prevention of dangerous events. In this paper, we introduce a new method to capture deep temporal information in videos for human interaction prediction. We propose to use flow coding images to represent the low-level motion information in videos and extract deep temporal features using a deep convolutional neural network architecture. We tested our method on the UT-Interaction dataset and the challenging TV human interaction dataset, and demonstrated the advantages of the proposed deep temporal features based on flow coding images. The proposed method, though using only the temporal information, outperforms the state of the art methods for human interaction prediction. Keywords: Interaction prediction
1
· CNN · Temporal convolution
Introduction
Interaction prediction, or early event recognition, aims to infer an interaction at its early stage [1]. It can help in preventing harmful events (e.g., fighting) in a surveillance scenario. It is also essential to robot-human interaction (e.g., when a human lifts his/her hand or opens his/her arms, the robot could then respond accordingly). Unlike interaction recognition, interaction prediction requires the inference of the action before it occurs. This requires the prediction of any potential future action, using the frames captured prior to the action. We can see from Fig. 1 that it is difficult to infer the action class from a single frame. The temporal information and the combination of several frames, on the other hand, provide more information about the future action class. In this paper, we focus on the c Springer International Publishing Switzerland 2016 G. Hua and H. J´ egou (Eds.): ECCV 2016 Workshops, Part II, LNCS 9914, pp. 403–414, 2016. DOI: 10.1007/978-3-319-48881-3 28
404
Q. Ke et al.
Fig. 1. Human interaction prediction. The goal is to predict the interaction class before it happens, which is difficult to achieve from a single frame.
temporal information of video sequences and introduce a new deep temporal feature for human interaction prediction. Existing interaction prediction methods mainly use spatial features (e.g., bag-of-words) [1], or combine spatial and temporal features (e.g., histogram of oriented optical flow) [2] to represent the video frames. These hand-crafted features are, however, not powerful enough to capture the salient motion information for interaction prediction due to their loss of the global structure in the data [3]. Recent works in large-scale recognition tasks [4,5] show that deep learned representations perf
Data Loading...