Train Sparsely, Generate Densely: Memory-Efficient Unsupervised Training of High-Resolution Temporal GAN

  • PDF / 3,011,847 Bytes
  • 21 Pages / 595.276 x 790.866 pts Page_size
  • 42 Downloads / 141 Views

DOWNLOAD

REPORT


Train Sparsely, Generate Densely: Memory-Efficient Unsupervised Training of High-Resolution Temporal GAN Masaki Saito1

· Shunta Saito1 · Masanori Koyama1 · Sosuke Kobayashi1

Received: 15 May 2019 / Accepted: 21 April 2020 © Springer Science+Business Media, LLC, part of Springer Nature 2020

Abstract Training of generative adversarial network (GAN) on a video dataset is a challenge because of the sheer size of the dataset and the complexity of each observation. In general, the computational cost of training GAN scales exponentially with the resolution. In this study, we present a novel memory efficient method of unsupervised learning of high-resolution video dataset whose computational cost scales only linearly with the resolution. We achieve this by designing the generator model as a stack of small sub-generators and training the model in a specific way. We train each sub-generator with its own specific discriminator. At the time of the training, we introduce between each pair of consecutive sub-generators an auxiliary subsampling layer that reduces the frame-rate by a certain ratio. This procedure can allow each sub-generator to learn the distribution of the video at different levels of resolution. We also need only a few GPUs to train a highly complex generator that far outperforms the predecessor in terms of inception scores. Keywords Generative adversarial network · Video generation · Subsampling layer

1 Introduction Generative adversarial network (GAN) is a powerful family of unsupervised learning, and various versions of GANs have been developed to date for different types of datasets, including image and audio dataset. GANs have been particularly successful for its application to image dataset (Radford et al. 2016; Miyato et al. 2018; Brock et al. 2018). In this study, we present a novel method of unsupervised learning for video dataset, an important type of dataset with numerous applications such as autonomous vehicles, creative tasks, video compression, and frame interpolation.

Communicated by Jun-Yan Zhu, Hongsheng Li, Eli Shechtman, MingYu Liu, Jan Kautz, Antonio Torralba.

B

Masaki Saito [email protected] Shunta Saito [email protected] Masanori Koyama [email protected] Sosuke Kobayashi [email protected]

1

There are two major challenges in training a generative model for video dataset. The first challenge comes from the sheer complexity of each observation. Video has a time dimension in addition to width and height, and the correlation between each pair of time frames are usually governed by complex dynamics underlying the system. Also, many applications of video generation methods—including those pertaining to industrial projects—require every frame of the generated video to be photo-realistic. This is a challenging problem on its own because photo-realistic image generation has been made possible only recently with the invention of techniques to stabilize the training of GANs on large dataset (Karras et al. 2018; Miyato et al. 2018; Mescheder et al. 2018). One must not only prepare a m