High-Quality Video Generation from Static Structural Annotations

  • PDF / 6,922,771 Bytes
  • 18 Pages / 595.276 x 790.866 pts Page_size
  • 105 Downloads / 233 Views

DOWNLOAD

REPORT


High-Quality Video Generation from Static Structural Annotations Lu Sheng1

· Junting Pan2 · Jiaming Guo3 · Jing Shao4 · Chen Change Loy5

Received: 15 May 2019 / Accepted: 24 April 2020 © Springer Science+Business Media, LLC, part of Springer Nature 2020

Abstract This paper proposes a novel unsupervised video generation that is conditioned on a single structural annotation map, which in contrast to prior conditioned video generation approaches, provides a good balance between motion flexibility and visual quality in the generation process. Different from end-to-end approaches that model the scene appearance and dynamics in a single shot, we try to decompose this difficult task into two easier sub-tasks in a divide-and-conquer fashion, thus achieving remarkable results overall. The first sub-task is an image-to-image (I2I) translation task that synthesizes high-quality starting frame from the input structural annotation map. The second image-to-video (I2V) generation task applies the synthesized starting frame and the associated structural annotation map to animate the scene dynamics for the generation of a photorealistic and temporally coherent video. We employ a cycle-consistent flow-based conditioned variational autoencoder to capture the long-term motion distributions, by which the learned bi-directional flows ensure the physical reliability of the predicted motions and provide explicit occlusion handling in a principled manner. Integrating structural annotations into the flow prediction also improves the structural awareness in the I2V generation process. Quantitative and qualitative evaluations over the autonomous driving and human action datasets demonstrate the effectiveness of the proposed approach over the state-of-the-art methods. The code has been released: https://github.com/junting/seg2vid. Keywords Unsupervised learning · Conditioned generative model · Image and video synthesis · Motion prediction and estimatiovn

1 Introduction Communicated by Jun-Yan Zhu, Hongsheng Li, Eli Shechtman, MingYu Liu, Jan Kautz, Antonio Torralba.

1

College of Software, Beihang University, Beijing, China

2

CUHK-SenseTime Joint Lab, The Chinese University of Hong Kong, Hong Kong SAR, China

Mapping visual abstracts to actual pixels in a video is an important inverse task of video understanding, which provides another viewpoint in understanding visual dynamics and is essential to advancing the development of intelligent agents that perceive the visual world similarly as human beings. This task is also interesting and useful for a wide range of applications in computer vision, computer graphics, robotics and even artistic creation, such as preparing training data for various data-demanding vision (Zheng et al. 2017) or reinforcement learning (Arulkumaran et al. 2017) tasks whose groundtruth annotations are too cumbersome to access, and photorealistic rendering of abstractive graphic contents or inversely artistic rendering of daily videos according to controllable style abstraction, etc. There has been much progress in