Temporal Convolutional Networks: A Unified Approach to Action Segmentation

The dominant paradigm for video-based action segmentation is composed of two steps: first, compute low-level features for each frame using Dense Trajectories or a Convolutional Neural Network to encode local spatiotemporal information, and second, input t

PDF / 504,550 Bytes
8 Pages / 439.37 x 666.142 pts Page_size
26 Downloads / 401 Views

DOWNLOAD

REPORT

Abstract. The dominant paradigm for video-based action segmentation is composed of two steps: ﬁrst, compute low-level features for each frame using Dense Trajectories or a Convolutional Neural Network to encode local spatiotemporal information, and second, input these features into a classiﬁer such as a Recurrent Neural Network (RNN) that captures high-level temporal relationships. While often eﬀective, this decoupling requires specifying two separate models, each with their own complexities, and prevents capturing more nuanced long-range spatiotemporal relationships. We propose a uniﬁed approach, as demonstrated by our Temporal Convolutional Network (TCN), that hierarchically captures relationships at low-, intermediate-, and high-level time-scales. Our model achieves superior or competitive performance using video or sensor data on three public action segmentation datasets and can be trained in a fraction of the time it takes to train an RNN.

1

Introduction

Action segmentation is crucial for numerous applications ranging from collaborative robotics to modeling activities of daily living. Given a video, the goal is to simultaneously segment every action in time and classify each constituent segment. While recent work has shown strong improvements on this task, models tend to decouple low-level feature representations from high-level temporal models. Within video analysis, these low-level features may be computed by pooling handcrafted features (e.g. Improved Dense Trajectories (IDT) [21]) or concatenating learned features (e.g. Spatiotemporal Convolutional Neural Networks (ST-CNN) [8,12]) over a short period of time. High-level temporal classiﬁers capture a local history of these low-level features. In a Conditional Random Field (CRF), the action prediction at one time step is are often a function of the prediction at the previous time step, and in a Recurrent Neural Network (RNN), the predictions are a function of a set of latent states at each time step, where the latent states are connected across time. This two-step paradigm has been around for decades (e.g., [6]) and typically goes unquestioned. However, we posit that valuable information is lost between steps. In this work, we introduce a uniﬁed approach to action segmentation that uses a single set of computational mechanisms – 1D convolutions, pooling, and channel-wise normalization – to hierarchically capture low-, intermediate-, and c Springer International Publishing Switzerland 2016 G. Hua and H. J´ egou (Eds.): ECCV 2016 Workshops, Part III, LNCS 9915, pp. 47–54, 2016. DOI: 10.1007/978-3-319-49409-8 7

48

C. Lea et al.

high-level temporal information. For each layer, 1D convolutions capture how features at lower levels change over time, pooling enables eﬃcient computation of long-range temporal patterns, and normalization improves robustness towards varying environmental conditions. In contrast with RNN-based models, which compute a set of latent activations that are updated sequentially per-frame, we compute a set of latent activations that ar

Data Loading...

Temporal Convolutional Networks: A Unified Approach to Action Segmentation

Recommend Documents

Assessing Action Quality via Attentive Spatio-Temporal Convolutional Networks

Boundary-Aware Cascade Networks for Temporal Action Segmentation

Time Series Encodings with Temporal Convolutional Networks

From Text Detection to Text Segmentation: A Unified Evaluation Scheme

Higher-Order Graph Convolutional Embedding for Temporal Networks

TeCNO: Surgical Phase Recognition with Multi-stage Temporal Convolutional Networks

A Generic Visualization Approach for Convolutional Neural Networks

Stance Detection Using Transformer Architectures and Temporal Convolutional Networks

Exploring Temporal Differences in 3D Convolutional Neural Networks

U-Net: Convolutional Networks for Biomedical Image Segmentation

Deep Convolutional Encoder Networks for Multiple Sclerosis Lesion Segmentation

Fetal Brain Segmentation Using Convolutional Neural Networks with Fusion Strategies