Deep Learning Based 3D Vision

  • PDF / 199,584 Bytes
  • 9 Pages / 504.467 x 719.855 pts Page_size
  • 115 Downloads / 352 Views

DOWNLOAD

REPORT


Deep Learning Based 3D Vision In So Kweon and Francois Rameau KAIST, Daejeon, South-Korea

Synonyms Machine learning for 3D vision

Related Concepts  3D Reconstruction  CNN  Deep Learning  Multi-view Stereo  Stereo Matching  Structure-From-Motion  Visual Localization  Visual Odometry

Definition The field of 3D vision covers a large range of techniques developed to estimate 3D information from one or multiple images, such as the absolute or relative pose of a camera and the 3D structure of the scene. For instance, among 3D vision problems, visual odometry and visual localization consist in estimating the pose of a camera in the environment, while stereo matching © Springer Nature Switzerland AG 2020 K. Ikeuchi (ed.), Computer Vision, https://doi.org/10.1007/978-3-030-03243-2_848-1

and multi-view stereo aim to reconstruct the 3D structure of the scene from different viewpoints. Conventional techniques developed to solve 3D vision tasks traditionally rely on low-level image features, such as sparse keypoints or dense photometric matching. These strategies are efficient under favorable conditions but tend to be sensitive to poorly textured scenes and usually do not encapsulate contextual and semantic information. Conversely, deep learning techniques are known for their ability to extract high-level features carrying relevant and useful information. Deep learning based 3D vision attempts to take advantage of deep neural network architectures to improve the robustness and reliability of 3D vision tasks under challenging contexts.

Deep Visual Odometry/SLAM Visual odometry (VO) consists in the joint estimation of the camera motion and the local structure of the scene using one or multiple cameras. Taking advantage of deep learning in the context of VO is particularly relevant to improve its robustness and applicability. Indeed, conventional techniques are still limited in their ability to reconstruct 3D at metric scale (in the monocular case) and remain particularly inefficient in texture-less environments where semantic cues can play a crucial role. The first techniques applying deep learning to the relative pose estimation problem focus on the motion and depth estimation between two

2

images. It is, for instance, the case of DeMoN [1] which iteratively refines the optical flow and the motion estimation of the sensor through a succession of encoder-decoder networks. This work led to the development of multiple analogous approaches dedicated to solve the motion and depth in an end-to-end manner. If these techniques are efficient, they are not specifically designed for visual odometry and tend to accumulate an important drift when a large sequence of images is processed. More ad hoc solutions have been designed to cope with the drift issues by imposing temporal constraints in the motion estimation process. The first attempt to incorporate such an a priori is DeepVO [2] which proposes to employ a Recurrent Convolutional Neural Network (RCNN) to learn the motion dynamics of the sensor and to impose sequential in