Sympathy for the Details: Dense Trajectories and Hybrid Classification Architectures for Action Recognition

Action recognition in videos is a challenging task due to the complexity of the spatio-temporal patterns to model and the difficulty to acquire and learn on large quantities of video data. Deep learning, although a breakthrough for image classification an

  • PDF / 3,346,303 Bytes
  • 20 Pages / 439.37 x 666.142 pts Page_size
  • 76 Downloads / 149 Views

DOWNLOAD

REPORT


Computer Vision Group, Xerox Research Center Europe, Meylan, France {cesar.desouza,adrien.gaidon}@xrce.xerox.com 2 Centre de Visi´ o per Computador, Universitat Aut` onoma de Barcelona, Bellaterra, Spain [email protected] 3 German Aerospace Center, Wessling, Germany [email protected]

Abstract. Action recognition in videos is a challenging task due to the complexity of the spatio-temporal patterns to model and the difficulty to acquire and learn on large quantities of video data. Deep learning, although a breakthrough for image classification and showing promise for videos, has still not clearly superseded action recognition methods using hand-crafted features, even when training on massive datasets. In this paper, we introduce hybrid video classification architectures based on carefully designed unsupervised representations of hand-crafted spatiotemporal features classified by supervised deep networks. As we show in our experiments on five popular benchmarks for action recognition, our hybrid model combines the best of both worlds: it is data efficient (trained on 150 to 10000 short clips) and yet improves significantly on the state of the art, including recent deep models trained on millions of manually labelled images and videos.

1

Introduction

Classifying human actions in real-world videos is an open research problem with many applications in multimedia, surveillance, and robotics [1]. Its complexity arises from the variability of imaging conditions, motion, appearance, context, and interactions with persons, objects, or the environment over different spatiotemporal extents. Current state-of-the-art algorithms for action recognition are based on statistical models learned from manually labeled videos. They belong to two main categories: models relying on features hand-crafted for action recognition (e.g., [2–10]), or more recent end-to-end deep architectures (e.g., [11–22]). Electronic supplementary material The online version of this chapter (doi:10. 1007/978-3-319-46478-7 43) contains supplementary material, which is available to authorized users. c Springer International Publishing AG 2016  B. Leibe et al. (Eds.): ECCV 2016, Part VII, LNCS 9911, pp. 697–716, 2016. DOI: 10.1007/978-3-319-46478-7 43

698

C.R. de Souza et al.

These approaches have complementary strengths and weaknesses. Models based on hand-crafted features are data efficient, as they can easily incorporate structured prior knowledge (e.g., the importance of motion boundaries along dense trajectories [2]), but their lack of flexibility may impede their robustness or modeling capacity. Deep models make fewer assumptions and are learned end-to-end from data (e.g., using 3D-ConvNets [23]), but they rely on hand-crafted architectures and the acquisition of large manually labeled video datasets (e.g., Sports1M [12]), a costly and error-prone process that poses optimization, engineering, and infrastructure challenges. Although deep learning for videos has recently made significant improvements (e.g., [13,14,23]), models using hand-crafted features are the