Patch attention network with generative adversarial model for semi-supervised binocular disparity prediction

  • PDF / 4,904,364 Bytes
  • 17 Pages / 595.276 x 790.866 pts Page_size
  • 22 Downloads / 202 Views

DOWNLOAD

REPORT


ORIGINAL ARTICLE

Patch attention network with generative adversarial model for semi-supervised binocular disparity prediction Zhibo Rao1

· Mingyi He1 · Yuchao Dai1 · Zhelun Shen2

Accepted: 16 October 2020 © Springer-Verlag GmbH Germany, part of Springer Nature 2020

Abstract In this paper, we address the challenging points of binocular disparity estimation: (1) unsatisfactory results in the occluded region when utilizing warping function in unsupervised learning; (2) inefficiency in running time and the number of parameters as adopting a lot of 3D convolutions in the feature matching module. To solve these drawbacks, we propose a patch attention network for semi-supervised stereo matching learning. First, we employ a channel-attention mechanism to aggregate the cost volume by selecting its different surfaces for reducing a large number of 3D convolution, called the patch attention network (PA-Net). Second, we use our proposed PA-Net as a generator and then combine it, traditional unsupervised learning loss, and the adversarial learning model to construct a semi-supervised learning framework for improving performance in the occluded areas. We have trained our PA-Net in supervised learning, semi-supervised learning, and unsupervised learning manners. Extensive experiments show that (1) our semi-supervised learning framework can overcome the drawbacks of unsupervised learning and significantly improve the performance in the ill-posed region by using only a few or inaccurate ground truths; (2) our PA-Net can outperform other state-of-the-art approaches in supervised learning and use fewer parameters. Keywords Binocular disparity estimation · Semi-supervised learning · Patch attention mechanism · Generative adversarial model

1 Introduction Stereo matching is fundamental research in computer vision applications, such as autonomous driving [26,41], robot navigation [3,31,38], and 3D reconstruction [13,20]. It aims to estimate the disparity map by matching pixels between a pair of rectified images [22]. Following the groundbreaking work of deep learning, current state-of-the-art stereo matching methods employ deep convolutional neural networks (CNNs) to regress a dense disparity map [4,8,18]. From the perspective of the network structure, the model can be decomposed into three modules: feature extraction, feature matching, and disparity regression [5,50]. Among them, the feature matching module is a crucial step to obtain accurate disparity estimation. In recent years, 3D convolution

B B

Zhibo Rao [email protected] Mingyi He [email protected]

1

Northwestern Polytechnical University, Xian 710129, China

2

Peking University, Beijing 100871, China

operation is often used to build the relationship among disparity, height, width, and feature dimensions [4,18]. The results indicate that the 3D convolution operation can enhance the geometry learning ability and improve the matching accuracy in the occlusions. However, it also brings the computation cost problem by using a lot of 3D convolution operations to down-sampli