Multi-view 3D Models from Single Images with a Convolutional Network

We present a convolutional network capable of inferring a 3D representation of a previously unseen object given a single image of this object. Concretely, the network can predict an RGB image and a depth map of the object as seen from an arbitrary view. S

  • PDF / 4,320,115 Bytes
  • 16 Pages / 439.37 x 666.142 pts Page_size
  • 108 Downloads / 295 Views

DOWNLOAD

REPORT


Abstract. We present a convolutional network capable of inferring a 3D representation of a previously unseen object given a single image of this object. Concretely, the network can predict an RGB image and a depth map of the object as seen from an arbitrary view. Several of these depth maps fused together give a full point cloud of the object. The point cloud can in turn be transformed into a surface mesh. The network is trained on renderings of synthetic 3D models of cars and chairs. It successfully deals with objects on cluttered background and generates reasonable predictions for real images of cars. Keywords: 3D from single image works

1

· Deep learning · Convolutional net-

Introduction

The ability to infer a 3D model of an object from a single image is necessary for human-level scene understanding. Despite the large success of deep learning in computer vision and the diversity of tasks being approached, 3D representations are not yet in the focus of deep networks. Can we make deep networks learn such 3D representations? In this paper, we present a simple and elegant encoder-decoder network that infers a 3D model of an object from a single image of this object, see Fig. 1. We represent the object by what we call “multi-view 3D model” – the set of all its views and corresponding depth maps. Given an arbitrary viewpoint, the network we propose generates an RGB image of the object and the depth map. This representation contains rich information about the 3D geometry of the object, but allows for more efficient implementation than voxel-based 3D models. By fusing several views from our multi-view representation we get a full 3D point cloud of the object, including parts invisible in the original input image. While technically the task comes with many ambiguities, humans are known to be good in using their prior knowledge about similar objects to guess the missing information. The same is achieved by the proposed network: when the Electronic supplementary material The online version of this chapter (doi:10. 1007/978-3-319-46478-7 20) contains supplementary material, which is available to authorized users. c Springer International Publishing AG 2016  B. Leibe et al. (Eds.): ECCV 2016, Part VII, LNCS 9911, pp. 322–337, 2016. DOI: 10.1007/978-3-319-46478-7 20

Multi-view 3D Models from Single Images with a Convolutional Network

323

Fig. 1. Our network infers an object’s 3D representation from a single input image. It then predicts unseen views of this object and their depth maps. Multiple such views are fused into a full 3D point cloud, which is further optimized to obtain a mesh.

input image does not allow the network to infer the parts of an object – for example, because the input only shows the front view of a car and there is no information about its back – it fantasizes the most probable shape consistent with the presented data (for example, a standard sedan car). The network is trained end-to-end on renderings of 3D models from the ShapeNet dataset [1]. We render images on the fly during network training, w