Towards Image-to-Video Translation: A Structure-Aware Approach via Multi-stage Generative Adversarial Networks

  • PDF / 3,721,770 Bytes
  • 20 Pages / 595.276 x 790.866 pts Page_size
  • 2 Downloads / 182 Views

DOWNLOAD

REPORT


Towards Image-to-Video Translation: A Structure-Aware Approach via Multi-stage Generative Adversarial Networks Long Zhao1 · Xi Peng2 · Yu Tian1 · Mubbasir Kapadia1 · Dimitris N. Metaxas1 Received: 28 April 2019 / Accepted: 4 April 2020 © Springer Science+Business Media, LLC, part of Springer Nature 2020

Abstract In this paper, we consider the problem of image-to-video translation, where one or a set of input images are translated into an output video which contains motions of a single object. Especially, we focus on predicting motions conditioned by highlevel structures, such as facial expression and human pose. Recent approaches are either driven by structural conditions or temporal-based. Condition-driven approaches typically train transformation networks to generate future frames conditioned on the predicted structural sequence. Temporal-based approaches, on the other hand, have shown that short high-quality motions can be generated using 3D convolutional networks with temporal knowledge learned from massive training data. In this work, we combine the benefits of both approaches and propose a two-stage generative framework where videos are forecast from the structural sequence and then refined by temporal signals. To model motions more efficiently in the forecasting stage, we train networks with dense connections to learn residual motions between the current and future frames, which avoids learning motion-irrelevant details. To ensure temporal consistency in the refining stage, we adopt the ranking loss for adversarial training. We conduct extensive experiments on two image-to-video translation tasks: facial expression retargeting and human pose forecasting. Superior results over the state of the art on both tasks demonstrate the effectiveness of our approach. Keywords Image-to-video translation · Video generation · Multi-stage GANs · Motion prediction · Residual learning

1 Introduction Generative modeling of images and videos is a fundamental but challenging problem in computer vision. Previous methods, such as Variational Auto-Encoders (VAEs) (Kingma Communicated by Jun-Yan Zhu, Hongsheng Li, Eli Shechtman, MingYu Liu, Jan Kautz, Antonio Torralba.

B

Long Zhao [email protected] Xi Peng [email protected] Yu Tian [email protected] Mubbasir Kapadia [email protected] Dimitris N. Metaxas [email protected]

1

Rutgers University, Piscataway, NJ 08854, USA

2

University of Delaware, Newark, DE 19716, USA

et al. 2014; Rezende et al. 2014), adopt probabilistic graphical models to maximize the lower bound of data likelihood. Other methods, such as PixelRNN (van den Oord et al. 2016), aim to model the conditional distribution of the pixel space for image generation. Recent progress made in this field with Generative Adversarial Networks (GANs) (Goodfellow et al. 2014) has attracted a lot of research interests. During the training of GANs, a generator and a discriminator play a zero-sum game: the generator targets at producing samples towards the true data distribution to fool the discriminator, while the di