Compositional GAN: Learning Image-Conditional Binary Composition

  • PDF / 3,985,819 Bytes
  • 16 Pages / 595.276 x 790.866 pts Page_size
  • 46 Downloads / 247 Views

DOWNLOAD

REPORT


Compositional GAN: Learning Image-Conditional Binary Composition Samaneh Azadi1 · Deepak Pathak1 · Sayna Ebrahimi1 · Trevor Darrell1 Received: 20 April 2019 / Accepted: 30 April 2020 © Springer Science+Business Media, LLC, part of Springer Nature 2020

Abstract Generative Adversarial Networks can produce images of remarkable complexity and realism but are generally structured to sample from a single latent source ignoring the explicit spatial interaction between multiple entities that could be present in a scene. Capturing such complex interactions between different objects in the world, including their relative scaling, spatial layout, occlusion, or viewpoint transformation is a challenging problem. In this work, we propose a novel self-consistent Composition-by-Decomposition network to compose a pair of objects. Given object images from two distinct distributions, our model can generate a realistic composite image from their joint distribution following the texture and shape of the input objects. We evaluate our approach through qualitative experiments and user evaluations. Our results indicate that the learned model captures potential interactions between the two object domains, and generates realistic composed scenes at test time. Keywords Conditional Generative Adversarial Network · Composition · Decomposition

1 Introduction Conditional Generative Adversarial Networks (cGANs) have emerged as a powerful method for generating images conditioned on a given input. The input cue could be in the form of an image (Isola et al. 2017; Zhu et al. 2017a; Liu et al. 2017; Azadi et al. 2017; Wang et al. 2017; Pathak et al. 2016), a text phrase (Zhang et al. 2017; Reed et al. 2016b, a; Johnson et al. 2018) or a class label layout (Mirza and Osindero 2014; Odena et al. 2016; Antoniou et al. 2017). The goal in most of these GAN instances is to learn a mapping that translates a given sample from the source distribution to generate a sample from the output distribution. This primarily involves transforming either a single object of interest (apples to oranges, horses to zebras, label to image, etc.) Communicated by Jun-Yan Zhu, Hongsheng Li, Eli Shechtman, MingYu Liu, Jan Kautz, Antonio Torralba.

B

Samaneh Azadi [email protected] Deepak Pathak [email protected] Sayna Ebrahimi [email protected] Trevor Darrell [email protected]

1

University of California, Berkeley, USA

or changing the style and texture of the input image (day to night, etc.). However, these direct transformations do not capture the fact that a natural image is a 2D projection of a composition of multiple objects interacting in a 3D visual world. Here, we explore the role of compositionality in GAN frameworks and propose a new method which learns a function that maps images of different objects sampled from their marginal distributions (e.g., chair and table) into a combined sample (table–chair) that captures the joint distribution of object pairs. In this paper, we specifically focus on the composition of a pair of objects. Modeli