High-Quality Video Generation from Static Structural Annotations

PDF / 6,922,771 Bytes
18 Pages / 595.276 x 790.866 pts Page_size
105 Downloads / 266 Views

High-Quality Video Generation from Static Structural Annotations Lu Sheng1

· Junting Pan2 · Jiaming Guo3 · Jing Shao4 · Chen Change Loy5

Received: 15 May 2019 / Accepted: 24 April 2020 © Springer Science+Business Media, LLC, part of Springer Nature 2020

Abstract This paper proposes a novel unsupervised video generation that is conditioned on a single structural annotation map, which in contrast to prior conditioned video generation approaches, provides a good balance between motion flexibility and visual quality in the generation process. Different from end-to-end approaches that model the scene appearance and dynamics in a single shot, we try to decompose this difficult task into two easier sub-tasks in a divide-and-conquer fashion, thus achieving remarkable results overall. The first sub-task is an image-to-image (I2I) translation task that synthesizes high-quality starting frame from the input structural annotation map. The second image-to-video (I2V) generation task applies the synthesized starting frame and the associated structural annotation map to animate the scene dynamics for the generation of a photorealistic and temporally coherent video. We employ a cycle-consistent flow-based conditioned variational autoencoder to capture the long-term motion distributions, by which the learned bi-directional flows ensure the physical reliability of the predicted motions and provide explicit occlusion handling in a principled manner. Integrating structural annotations into the flow prediction also improves the structural awareness in the I2V generation process. Quantitative and qualitative evaluations over the autonomous driving and human action datasets demonstrate the effectiveness of the proposed approach over the state-of-the-art methods. The code has been released: https://github.com/junting/seg2vid. Keywords Unsupervised learning · Conditioned generative model · Image and video synthesis · Motion prediction and estimatiovn

1 Introduction Communicated by Jun-Yan Zhu, Hongsheng Li, Eli Shechtman, MingYu Liu, Jan Kautz, Antonio Torralba.

1

College of Software, Beihang University, Beijing, China

2

CUHK-SenseTime Joint Lab, The Chinese University of Hong Kong, Hong Kong SAR, China

Mapping visual abstracts to actual pixels in a video is an important inverse task of video understanding, which provides another viewpoint in understanding visual dynamics and is essential to advancing the development of intelligent agents that perceive the visual world similarly as human beings. This task is also interesting and useful for a wide range of applications in computer vision, computer graphics, robotics and even artistic creation, such as preparing training data for various data-demanding vision (Zheng et al. 2017) or reinforcement learning (Arulkumaran et al. 2017) tasks whose groundtruth annotations are too cumbersome to access, and photorealistic rendering of abstractive graphic contents or inversely artistic rendering of daily videos according to controllable style abstraction, etc. There has been much progress in

Data Loading...

High-Quality Video Generation from Static Structural Annotations

Recommend Documents

Extracting Annotations from Textual Descriptions of Processes

Historical Annotations

Static Scheduling Generation for Multicore Partitioned Systems

Framework for Automated Generation of Constructible Steel Erection Sequences Using Structural Information of Static Inde

Static generation of UML sequence diagrams

Video Coding The Second Generation Approach

Generation and Structural Analysis of Silicon Nanoparticles

Direct-from-Video: Unsupervised NRSfM

Learning Object Permanence from Video

Creating Stereoscopic (3D) Video from a 2D Monocular Video Stream

DTVNet: Dynamic Time-Lapse Video Generation via Single Still Image

Automated 3D sign language caption generation for video