Multimodal Image Synthesis with Conditional Implicit Maximum Likelihood Estimation

  • PDF / 6,692,426 Bytes
  • 22 Pages / 595.276 x 790.866 pts Page_size
  • 9 Downloads / 248 Views

DOWNLOAD

REPORT


Multimodal Image Synthesis with Conditional Implicit Maximum Likelihood Estimation Ke Li1 · Shichong Peng2 · Tianhao Zhang3 · Jitendra Malik1

© Springer Science+Business Media, LLC, part of Springer Nature 2020

Abstract Many tasks in computer vision and graphics fall within the framework of conditional image synthesis. In recent years, generative adversarial nets have delivered impressive advances in quality of synthesized images. However, it remains a challenge to generate both diverse and plausible images for the same input, due to the problem of mode collapse. In this paper, we develop a new generic multimodal conditional image synthesis method based on implicit maximum likelihood estimation and demonstrate improved multimodal image synthesis performance on two tasks, single image super-resolution and image synthesis from scene layouts. We make our implementation publicly available. Keywords Conditional image synthesis · Multimodal image synthesis · Deep generative models · Implicit maximum likelihood estimation

1 Introduction In conditional image synthesis, the goal is to generate an image from some input, which can influence the image that is generated. It encompasses a broad range of tasks; examples include super-resolution, which aims to generCommunicated by Jun-Yan Zhu, Hongsheng Li, Eli Shechtman, MingYu Liu, Jan Kautz, Antonio Torralba. Ke Li, Shichong Peng and Tianhao Zhang have contributed equally to this work Code for super-resolution is available at https://github.com/niopeng/ SRIM-pytorch and code for image synthesis from scene layout is available at https://github.com/zth667/Diverse-Image-Synthesisfrom-Semantic-Layout.

B

Ke Li [email protected] Shichong Peng [email protected] Tianhao Zhang [email protected] Jitendra Malik [email protected]

1

University of California, Berkeley, USA

2

University of Toronto, Toronto, Canada

3

Nanjing University, Nanjing, China

ate high-resolution images from low-resolution inputs, and image synthesis from scene layout, which aims to generate images from semantic segmentation maps. Deep learning has increasingly been used for image synthesis in recent years. Deep generative models, such as generative adversarial nets (GANs) (Goodfellow et al. 2014; Gutmann et al. 2014), have emerged as one of the most popular approaches and have delivered impressive advances in image quality. Predominant approaches focus on the setting of generating a single image for each input image, which we will refer to as the unimodal prediction problem. Relatively less attention has been devoted to the more general and challenging problem of multimodal prediction, which aims to generate multiple equally plausible images for the same input image (examples of multimodal image synthesis problems and a preview of the results are shown in Figs. 1 and 2). Why is the latter important? Conditional image synthesis is, by its very nature, ill-posed. That is, the information in the input is not enough to fully constrain the degrees of freedom in the output, and