Pix2Shape: Towards Unsupervised Learning of 3D Scenes from Images Using a View-Based Representation

PDF / 8,036,410 Bytes
16 Pages / 595.276 x 790.866 pts Page_size
29 Downloads / 228 Views

Pix2Shape: Towards Unsupervised Learning of 3D Scenes from Images Using a View-Based Representation Sai Rajeswar1,2 · Fahim Mannan3 · Florian Golemo1 · Jérôme Parent-Lévesque1 · David Vazquez2 · Derek Nowrouzezahrai4 · Aaron Courville1 Received: 15 May 2019 / Accepted: 10 March 2020 © Springer Science+Business Media, LLC, part of Springer Nature 2020

Abstract We infer and generate three-dimensional (3D) scene information from a single input image and without supervision. This problem is under-explored, with most prior work relying on supervision from, e.g., 3D ground-truth, multiple images of a scene, image silhouettes or key-points. We propose Pix2Shape, an approach to solve this problem with four component: (i) an encoder that infers the latent 3D representation from an image, (ii) a decoder that generates an explicit 2.5D surfelbased reconstruction of a scene—from the latent code—(iii) a differentiable renderer that synthesizes a 2D image from the surfel representation, and (iv) a critic network trained to discriminate between images generated by the decoder-renderer and those from a training distribution. Pix2Shape can generate complex 3D scenes that scale with the view-dependent on-screen resolution, unlike representations that capture world-space resolution, i.e., voxels or meshes. We show that Pix2Shape learns a consistent scene representation in its encoded latent space, and that the decoder can then be applied to this latent representation in order to synthesize the scene from a novel viewpoint. We evaluate Pix2Shape with experiments on the ShapeNet dataset as well as on a novel benchmark we developed – called 3D-IQTT—to evaluate models based on their ability to enable 3d spatial reasoning. Qualitative and quantitative evaluation demonstrate Pix2Shape’s ability to solve scene reconstruction, generation and understanding tasks. Keywords Computer vision · Differentiable rendering · 3D understanding · Adversarial training

1 Introduction Humans sense, plan and act in a 3D world despite only directly observing 2D projections of their 3D environment. Automatic 3D understanding seeks to recover a realistic underlying 3D structure of a scene using only 2D image projection(s). This long-standing challenge in computer vision has recently admitted learning-based solutions. Many such approaches leverage 3D supervision, such as from images annotated with ground truth 3D shape information (Girdhar Communicated by Jun-Yan Zhu, Hongsheng Li, Eli Shechtman, Ming-Yu Liu, Jan Kautz, Antonio Torralba.

B

Sai Rajeswar [email protected]

1

Université de Montreal, Montreal, Canada

2

Element AI, Montreal, Canada

3

Algolux, Montreal, Canada

4

McGill University, Montreal, Canada

et al. 2016; Wu et al. 2015, 2016b; Choy et al. 2016). Recent approaches rely on using other forms of 3D supervision, such as multiple views of the same object (Yan et al. 2016; Tulsiani et al. 2017; Li et al. 2019), 2.5D supervision (Wu et al. 2016a, 2017), key-point (Kar et al. 2014; Novotný et al. 2019) and silhouette annotations (Wiles and Z

Data Loading...

Pix2Shape: Towards Unsupervised Learning of 3D Scenes from Images Using a View-Based Representation

Recommend Documents

Isotropic Reconstruction of 3D EM Images with Unsupervised Degradation Learning

Learning an Unsupervised and Interpretable Representation of Emotion from Speech

Unsupervised representation learning with Minimax distance measures

Pix2Surf: Learning Parametric 3D Surface Models of Objects from Images

Learning to Learn Words from Visual Scenes

Unsupervised 3D Human Pose Representation with Viewpoint and Pose Disentanglement

Deep Learning 3D Shape Surfaces Using Geometry Images

Indexing 3D scenes

Unsupervised Multi-view CNN for Salient View Selection of 3D Objects and Scenes

Unsupervised Visual Time-Series Representation Learning and Clustering

Unsupervised Deep Representation Learning for Real-Time Tracking

Unsupervised Learning-Based Nonrigid Registration of High Resolution Histology Images