Shuffle and Learn: Unsupervised Learning Using Temporal Order Verification
In this paper, we present an approach for learning a visual representation from the raw spatiotemporal signals in videos. Our representation is learned without supervision from semantic labels. We formulate our method as an unsupervised sequential verific
- PDF / 9,497,711 Bytes
- 18 Pages / 439.37 x 666.142 pts Page_size
- 89 Downloads / 183 Views
e Robotics Institute, Carnegie Mellon University, Pittsburgh, USA {imisra,hebert}@cs.cmu.edu 2 Facebook AI Research, Menlo Park, USA [email protected]
Abstract. In this paper, we present an approach for learning a visual representation from the raw spatiotemporal signals in videos. Our representation is learned without supervision from semantic labels. We formulate our method as an unsupervised sequential verification task, i.e., we determine whether a sequence of frames from a video is in the correct temporal order. With this simple task and no semantic labels, we learn a powerful visual representation using a Convolutional Neural Network (CNN). The representation contains complementary information to that learned from supervised image datasets like ImageNet. Qualitative results show that our method captures information that is temporally varying, such as human pose. When used as pre-training for action recognition, our method gives significant gains over learning without external data on benchmark datasets like UCF101 and HMDB51. To demonstrate its sensitivity to human pose, we show results for pose estimation on the FLIC and MPII datasets that are competitive, or better than approaches using significantly more supervision. Our method can be combined with supervised representations to provide an additional boost in accuracy. Keywords: Unsupervised learning · Videos · Sequence verification Action recognition · Pose estimation · Convolutional neural networks
1
·
Introduction
Sequential data provides an abundant source of information in the form of auditory and visual percepts. Learning from the observation of sequential data is a natural and implicit process for humans [1–3]. It informs both low level cognitive tasks and high level abilities like decision making and problem solving [4]. For instance, answering the question “Where would the moving ball go?”, requires the development of basic cognitive abilities like prediction from sequential data like video [5]. Electronic supplementary material The online version of this chapter (doi:10. 1007/978-3-319-46448-0 32) contains supplementary material, which is available to authorized users. c Springer International Publishing AG 2016 B. Leibe et al. (Eds.): ECCV 2016, Part I, LNCS 9905, pp. 527–544, 2016. DOI: 10.1007/978-3-319-46448-0 32
528
I. Misra et al.
In this paper, we explore the power of spatiotemporal signals, i.e., videos, in the context of computer vision. To study the information available in a video signal in isolation, we ask the question: How does an agent learn from the spatiotemporal structure present in video without using supervised semantic labels? Are the representations learned using the unsupervised spatiotemporal information present in videos meaningful? And finally, are these representations complementary to those learned from strongly supervised image data? In this paper, we explore such questions by using a sequential learning approach. Sequential learning is used in a variety of areas such as speech recognition, robotic path planning, adap
Data Loading...