Progressive Multi-granularity Analysis for Video Prediction
- PDF / 7,439,666 Bytes
- 18 Pages / 595.276 x 790.866 pts Page_size
- 84 Downloads / 248 Views
Progressive Multi-granularity Analysis for Video Prediction Jingwei Xu1 · Bingbing Ni1 · Xiaokang Yang1,2 Received: 6 June 2019 / Accepted: 26 September 2020 © Springer Science+Business Media, LLC, part of Springer Nature 2020
Abstract Video prediction is challenging as real-world motion dynamics are usually multi-modally distributed. Existing stochastic methods commonly formulate random noise input with simple prior distribution, which is insufficient to model highly complex motion dynamics. This work proposes a progressive multiple granularity analysis framework to tackle the above difficulty. Firstly, to achieve coarse alignment, the input sequence is matched to prototype motion dynamics in the training set, based on self-supervised auto-encoder learning via motion/appearance disentanglement. Secondly, motion dynamics is transferred from the matched prototype sequence to input sequence via adaptively learned kernel, and the predicted frames are further refined through a motion-aware prediction model. Extensive qualitative and quantitative experiments on three widely used video prediction datasets demonstrate that: (1) the proposed framework essentially decomposes the hard task into a series of more approachable sub-tasks where a better solution is easier to be sought and (2) our proposed method performs favorably against state-of-the-art prediction methods. Keyword Video prediction · Multiple granularity analysis
1 Introduction As a naturally data-driven routine to model the dynamics of a sophisticated system, video prediction has demonstrated tremendous potential value in many downstream applications (Pathak et al. 2017; Kurutach et al. 2018; Nair et al. 2018), such as model-based reinforcement learning, driving path planning and robot manipulation. Given several consecutive frames as input, the goal of video prediction is to generate raw pixels of future frames. Therefore, different from the conventional video semantic prediction problem whose output is relatively a low-dimensional label vector, this task requires pixel-level prediction, i.e., usually with multiple timestamps. Pixel-wise prediction leads to solution space growing exponentially w.r.t. spatial and temporal size of predicted frames. Conventional methods (Finn et al. 2016; Jia et al. Communicated by Ivan Laptev.
B
Bingbing Ni [email protected] Jingwei Xu [email protected]
1
Shanghai Jiao Tong University, Shanghai 200240, China
2
MoE Key Lab of Artificial Intelligence, AI Institute, Shanghai Jiao Tong University, Shanghai, China
2016; Denton and Birodkar 2017) with recurrent and deterministic architecture often fail to predict high quality video frames. Shape deformation and prediction mismatch are typical issues yet to be solved. All these problems closely relate to one fundamental challenge in this task, i.e., error accumulation (Pathak et al. 2017; Kurutach et al. 2018; Nair et al. 2018). The main reason of this challenge lies in that longterm prediction leads to a highly complex and multi-modal distribution of future frames, but
Data Loading...