Stereo Frustums: a Siamese Pipeline for 3D Object Detection
- PDF / 7,818,596 Bytes
- 15 Pages / 595.224 x 790.955 pts Page_size
- 12 Downloads / 249 Views
Stereo Frustums: a Siamese Pipeline for 3D Object Detection Xi Mo1
· Usman Sajid1 · Guanghui Wang2
Received: 19 July 2020 / Accepted: 27 October 2020 © Springer Nature B.V. 2020
Abstract The paper proposes a light-weighted stereo frustums matching module for 3D objection detection. The proposed framework takes advantage of a high-performance 2D detector and a point cloud segmentation network to regress 3D bounding boxes for autonomous driving vehicles. Instead of performing traditional stereo matching to compute disparities, the module directly takes the 2D proposals from both the left and the right views as input. Based on the epipolar constraints recovered from the well-calibrated stereo cameras, we propose four matching algorithms to search for the best match for each proposal between the stereo image pairs. Each matching pair proposes a segmentation of the scene which is then fed into a 3D bounding box regression network. Results of extensive experiments on KITTI dataset demonstrate that the proposed Siamese pipeline outperforms the state-of-the-art stereo-based 3D bounding box regression methods. Keywords Stereopsis · LiDAR · Stereo matching · Epipolar constraint · Segmentation · Amodal regression
1 Introduction How to regress accurate 3D bounding boxes (bbox) for autonomous driving vehicles has become a pivotal topic recently. This technique can also benefit mobile robots and unmanned aerial vehicles with regard to scene understanding and reasoning. In this paper, we propose a Siamese pipeline method for 3D object detection. Given a pair of stereo images and the point cloud data collected by velodyne [5], many approaches on a basis of deep-learning theories have been proposed to generate 3D bbox artifacts which can also be projected to a bird’seye view (BEV) of LiDAR data for localization evaluation. According to the number of image views these approaches Guanghui Wang
[email protected] Xi Mo [email protected] Usman Sajid [email protected] 1
Department of Electrical Engineering and Computer Science, School of Engineering, University of Kansas, Lawrence, KS 66045, USA
2
Department of Computer Science, Ryerson University, Toronto, ON M5B 2K3, Canada
utilized, they can be divided into three categories: monocular view [3, 8, 11, 16, 21, 23, 27, 28], binocular views [2, 7, 10, 19, 26], and non-view approaches [13, 22, 25, 30–32, 35] that only processes point cloud. Mono-view based approaches focus on sensor-fusion of the camera and LiDAR sensors in either a global or a local manner, while non-view approaches extract point cloud features from hand-crafted voxels or raw coordinates. Compared to the extensive development in both categories mentioned above, there are fewer stereo-based and stereopsis-LiDAR-fusion works for 3D object detection. Considering the runtime of stereo matching, coarse disparity map generated by fast stereo matching and GPU acceleration achieves real-time frame-rate, yet less accurate 3D detection results [7] compared with that of coarse-to-fine disparity map [26]. However, it usually tak
Data Loading...