Sympathy for the Details: Dense Trajectories and Hybrid Classification Architectures for Action Recognition

Action recognition in videos is a challenging task due to the complexity of the spatio-temporal patterns to model and the difficulty to acquire and learn on large quantities of video data. Deep learning, although a breakthrough for image classification an

PDF / 3,346,303 Bytes
20 Pages / 439.37 x 666.142 pts Page_size
76 Downloads / 167 Views

DOWNLOAD

REPORT

Computer Vision Group, Xerox Research Center Europe, Meylan, France {cesar.desouza,adrien.gaidon}@xrce.xerox.com 2 Centre de Visi´ o per Computador, Universitat Aut` onoma de Barcelona, Bellaterra, Spain [email protected] 3 German Aerospace Center, Wessling, Germany [email protected]

Abstract. Action recognition in videos is a challenging task due to the complexity of the spatio-temporal patterns to model and the diﬃculty to acquire and learn on large quantities of video data. Deep learning, although a breakthrough for image classiﬁcation and showing promise for videos, has still not clearly superseded action recognition methods using hand-crafted features, even when training on massive datasets. In this paper, we introduce hybrid video classiﬁcation architectures based on carefully designed unsupervised representations of hand-crafted spatiotemporal features classiﬁed by supervised deep networks. As we show in our experiments on ﬁve popular benchmarks for action recognition, our hybrid model combines the best of both worlds: it is data eﬃcient (trained on 150 to 10000 short clips) and yet improves signiﬁcantly on the state of the art, including recent deep models trained on millions of manually labelled images and videos.

1

Introduction

Classifying human actions in real-world videos is an open research problem with many applications in multimedia, surveillance, and robotics [1]. Its complexity arises from the variability of imaging conditions, motion, appearance, context, and interactions with persons, objects, or the environment over diﬀerent spatiotemporal extents. Current state-of-the-art algorithms for action recognition are based on statistical models learned from manually labeled videos. They belong to two main categories: models relying on features hand-crafted for action recognition (e.g., [2–10]), or more recent end-to-end deep architectures (e.g., [11–22]). Electronic supplementary material The online version of this chapter (doi:10. 1007/978-3-319-46478-7 43) contains supplementary material, which is available to authorized users. c Springer International Publishing AG 2016 B. Leibe et al. (Eds.): ECCV 2016, Part VII, LNCS 9911, pp. 697–716, 2016. DOI: 10.1007/978-3-319-46478-7 43

698

C.R. de Souza et al.

These approaches have complementary strengths and weaknesses. Models based on hand-crafted features are data eﬃcient, as they can easily incorporate structured prior knowledge (e.g., the importance of motion boundaries along dense trajectories [2]), but their lack of ﬂexibility may impede their robustness or modeling capacity. Deep models make fewer assumptions and are learned end-to-end from data (e.g., using 3D-ConvNets [23]), but they rely on hand-crafted architectures and the acquisition of large manually labeled video datasets (e.g., Sports1M [12]), a costly and error-prone process that poses optimization, engineering, and infrastructure challenges. Although deep learning for videos has recently made signiﬁcant improvements (e.g., [13,14,23]), models using hand-crafted features are the

Data Loading...

Sympathy for the Details: Dense Trajectories and Hybrid Classification Architectures for Action Recognition

Recommend Documents

Coded Access Architectures for Dense Memory Systems

Human Action Classification and Unusual Action Recognition Algorithm for Intelligent Surveillance System

Skeleton-Based Action Recognition with Dense Spatial Temporal Graph Network

Ergreifende Herzensangelegenheiten und Sympathy for the Devil?

Temporal Distinct Representation Learning for Action Recognition

Hybrid Gold Architectures for Sensing and Catalytic Applications

Video Action Recognition Based on Hybrid Convolutional Network

Human Action Recognition Using a New Hybrid Descriptor

Perception-Action Cycle Models, Architectures, and Hardware

Hybrid nonlinear convolution filters for image recognition

RNN Fisher Vectors for Action Recognition and Image Annotation

Motion History Images for Action Recognition and Understanding