Unsupervised video-to-video translation with preservation of frame modification tendency

  • PDF / 2,554,022 Bytes
  • 12 Pages / 595.276 x 790.866 pts Page_size
  • 24 Downloads / 176 Views

DOWNLOAD

REPORT


ORIGINAL ARTICLE

Unsupervised video-to-video translation with preservation of frame modification tendency Huajun Liu1 · Chao Li1 · Dian Lei1 · Qing Zhu2

© Springer-Verlag GmbH Germany, part of Springer Nature 2020

Abstract Tremendous advances have been achieved in image translation with the employment of generative adversarial networks (GANs). With respect to video-to-video translation, similar idea has been leveraged by various researches, which may focus on the associations among relevant frames. However, the existing video-synthesis methods based on GANs do not make full exploitation of the spatial–temporal information in videos, especially in the continuous frames. In this paper, we propose an efficient method to conduct video translation that can preserve the frame modification trends in sequential frames of the original video and smooth the variations between the generated frames. To constrain the consistency of the mentioned tendency between the generated video and the original one, we propose a tendency-invariant loss to impel further exploitation of spatial-temporal information. Experiments show that our method is able to learn more abundant information of adjacent frames and generate more desirable videos than the baselines, i.e., Recycle-GAN and CycleGAN. Keywords Video translation · Generative adversarial networks · Unsupervised · Spatial-temporal information

1 Introduction Nowadays, image translation is being widely-used to synthesize vivid pictures into required styles, e.g., we can acquire realistic photographs in styles of Van Gogh and Monet paintings. Likewise, we can also use a similar idea to translate the videos, which contain much more abundant spatial-temporal information compared to static image data. Namely, images are separated from each other, while the frames of a video are tightly correlated. As a result, the methods for static image translation can hardly meet the requirement of video synthe-

B B

Huajun Liu [email protected] Dian Lei [email protected] Chao Li [email protected] Qing Zhu [email protected]

1

School of Computer Science, Wuhan University, Wuhan, China

2

Faculty of Geosciences and Environmental Engineering, Southwest Jiaotong University, Chengdu, China

sis, as they do not take into consideration the spatial-temporal continuity of the video frames. There are many approaches that try to synthesize videos into new styles using generative adversarial networks (GANs) [12]. Wang et al. [37] propose a video-to-video synthesis approach using GAN framework as well as the spatialtemporal adversarial objective to synthesize high-resolution and temporally coherent videos, which calls for the input of paired data. As for unpaired videos, Bansal et al. [2] combine spatial–temporal information along with adversarial losses for content translation and style preservation. These methods both exploit the frame continuity in the videos, which manifest better performances than previous method that just utilizes the information in single frames [34]. Nonetheless, they tend to concent