Pix2Shape: Towards Unsupervised Learning of 3D Scenes from Images Using a View-Based Representation
- PDF / 8,036,410 Bytes
- 16 Pages / 595.276 x 790.866 pts Page_size
- 29 Downloads / 223 Views
Pix2Shape: Towards Unsupervised Learning of 3D Scenes from Images Using a View-Based Representation Sai Rajeswar1,2 · Fahim Mannan3 · Florian Golemo1 · Jérôme Parent-Lévesque1 · David Vazquez2 · Derek Nowrouzezahrai4 · Aaron Courville1 Received: 15 May 2019 / Accepted: 10 March 2020 © Springer Science+Business Media, LLC, part of Springer Nature 2020
Abstract We infer and generate three-dimensional (3D) scene information from a single input image and without supervision. This problem is under-explored, with most prior work relying on supervision from, e.g., 3D ground-truth, multiple images of a scene, image silhouettes or key-points. We propose Pix2Shape, an approach to solve this problem with four component: (i) an encoder that infers the latent 3D representation from an image, (ii) a decoder that generates an explicit 2.5D surfelbased reconstruction of a scene—from the latent code—(iii) a differentiable renderer that synthesizes a 2D image from the surfel representation, and (iv) a critic network trained to discriminate between images generated by the decoder-renderer and those from a training distribution. Pix2Shape can generate complex 3D scenes that scale with the view-dependent on-screen resolution, unlike representations that capture world-space resolution, i.e., voxels or meshes. We show that Pix2Shape learns a consistent scene representation in its encoded latent space, and that the decoder can then be applied to this latent representation in order to synthesize the scene from a novel viewpoint. We evaluate Pix2Shape with experiments on the ShapeNet dataset as well as on a novel benchmark we developed – called 3D-IQTT—to evaluate models based on their ability to enable 3d spatial reasoning. Qualitative and quantitative evaluation demonstrate Pix2Shape’s ability to solve scene reconstruction, generation and understanding tasks. Keywords Computer vision · Differentiable rendering · 3D understanding · Adversarial training
1 Introduction Humans sense, plan and act in a 3D world despite only directly observing 2D projections of their 3D environment. Automatic 3D understanding seeks to recover a realistic underlying 3D structure of a scene using only 2D image projection(s). This long-standing challenge in computer vision has recently admitted learning-based solutions. Many such approaches leverage 3D supervision, such as from images annotated with ground truth 3D shape information (Girdhar Communicated by Jun-Yan Zhu, Hongsheng Li, Eli Shechtman, Ming-Yu Liu, Jan Kautz, Antonio Torralba.
B
Sai Rajeswar [email protected]
1
Université de Montreal, Montreal, Canada
2
Element AI, Montreal, Canada
3
Algolux, Montreal, Canada
4
McGill University, Montreal, Canada
et al. 2016; Wu et al. 2015, 2016b; Choy et al. 2016). Recent approaches rely on using other forms of 3D supervision, such as multiple views of the same object (Yan et al. 2016; Tulsiani et al. 2017; Li et al. 2019), 2.5D supervision (Wu et al. 2016a, 2017), key-point (Kar et al. 2014; Novotný et al. 2019) and silhouette annotations (Wiles and Z
Data Loading...