Leaving Some Stones Unturned: Dynamic Feature Prioritization for Activity Detection in Streaming Video

Current approaches for activity recognition often ignore constraints on computational resources: (1) they rely on extensive feature computation to obtain rich descriptors on all frames, and (2) they assume batch-mode access to the entire test video at onc

  • PDF / 528,751 Bytes
  • 18 Pages / 439.37 x 666.142 pts Page_size
  • 56 Downloads / 218 Views

DOWNLOAD

REPORT


Abstract. Current approaches for activity recognition often ignore constraints on computational resources: (1) they rely on extensive feature computation to obtain rich descriptors on all frames, and (2) they assume batch-mode access to the entire test video at once. We propose a new active approach to activity recognition that prioritizes “what to compute when” in order to make timely predictions. The main idea is to learn a policy that dynamically schedules the sequence of features to compute on selected frames of a given test video. In contrast to traditional static feature selection, our approach continually re-prioritizes computation based on the accumulated history of observations and accounts for the transience of those observations in ongoing video. We develop variants to handle both the batch and streaming settings. On two challenging datasets, our method provides significantly better accuracy than alternative techniques for a wide range of computational budgets.

1

Introduction

Activity recognition in video is a core vision challenge. It has applications in surveillance, autonomous driving, human-robot interaction, and automatic tagging for large-scale video retrieval. In any such setting, a system that can both categorize and temporally localize activities would be of great value. Activity recognition has attracted a steady stream of interesting research [1]. Recent methods are largely learning-based, and tackle realistic everyday activities (e.g., making tea, riding a bike). Due to the complexity of the problem, as well as the density of raw data comprising even short videos, useful video representations are often computationally intensive—whether dense trajectories, interest points, object detectors, or convolutional neural network (CNN) features run on each frame [2–8]. In fact, the expectation is that the more features one extracts from the video, the better for accuracy. For a practitioner wanting reliable activity recognition, then, the message is to “leave no stone unturned”, ideally extracting complementary descriptors from all video frames. Electronic supplementary material The online version of this chapter (doi:10. 1007/978-3-319-46478-7 48) contains supplementary material, which is available to authorized users. c Springer International Publishing AG 2016  B. Leibe et al. (Eds.): ECCV 2016, Part VII, LNCS 9911, pp. 783–800, 2016. DOI: 10.1007/978-3-319-46478-7 48

784

Y.-C. Su and K. Grauman

However, the “no stone unturned” strategy is problematic. Not only does it assume virtually unbounded computational resources, it also assumes that an entire video is available at once for batch processing. In reality, a recognition system will have some computational budget. Further, it may need to perform in a streaming manner, with access to only a short buffer of recent frames. Together, these considerations suggest some form of feature triage is needed. Yet prioritizing features for activity in video is challenging, for two key reasons. First, the most informative features may depend critically o