Temporal Convolutional Networks: A Unified Approach to Action Segmentation
The dominant paradigm for video-based action segmentation is composed of two steps: first, compute low-level features for each frame using Dense Trajectories or a Convolutional Neural Network to encode local spatiotemporal information, and second, input t
- PDF / 504,550 Bytes
- 8 Pages / 439.37 x 666.142 pts Page_size
- 26 Downloads / 360 Views
Abstract. The dominant paradigm for video-based action segmentation is composed of two steps: first, compute low-level features for each frame using Dense Trajectories or a Convolutional Neural Network to encode local spatiotemporal information, and second, input these features into a classifier such as a Recurrent Neural Network (RNN) that captures high-level temporal relationships. While often effective, this decoupling requires specifying two separate models, each with their own complexities, and prevents capturing more nuanced long-range spatiotemporal relationships. We propose a unified approach, as demonstrated by our Temporal Convolutional Network (TCN), that hierarchically captures relationships at low-, intermediate-, and high-level time-scales. Our model achieves superior or competitive performance using video or sensor data on three public action segmentation datasets and can be trained in a fraction of the time it takes to train an RNN.
1
Introduction
Action segmentation is crucial for numerous applications ranging from collaborative robotics to modeling activities of daily living. Given a video, the goal is to simultaneously segment every action in time and classify each constituent segment. While recent work has shown strong improvements on this task, models tend to decouple low-level feature representations from high-level temporal models. Within video analysis, these low-level features may be computed by pooling handcrafted features (e.g. Improved Dense Trajectories (IDT) [21]) or concatenating learned features (e.g. Spatiotemporal Convolutional Neural Networks (ST-CNN) [8,12]) over a short period of time. High-level temporal classifiers capture a local history of these low-level features. In a Conditional Random Field (CRF), the action prediction at one time step is are often a function of the prediction at the previous time step, and in a Recurrent Neural Network (RNN), the predictions are a function of a set of latent states at each time step, where the latent states are connected across time. This two-step paradigm has been around for decades (e.g., [6]) and typically goes unquestioned. However, we posit that valuable information is lost between steps. In this work, we introduce a unified approach to action segmentation that uses a single set of computational mechanisms – 1D convolutions, pooling, and channel-wise normalization – to hierarchically capture low-, intermediate-, and c Springer International Publishing Switzerland 2016 G. Hua and H. J´ egou (Eds.): ECCV 2016 Workshops, Part III, LNCS 9915, pp. 47–54, 2016. DOI: 10.1007/978-3-319-49409-8 7
48
C. Lea et al.
high-level temporal information. For each layer, 1D convolutions capture how features at lower levels change over time, pooling enables efficient computation of long-range temporal patterns, and normalization improves robustness towards varying environmental conditions. In contrast with RNN-based models, which compute a set of latent activations that are updated sequentially per-frame, we compute a set of latent activations that ar
Data Loading...