Gated Bi-directional CNN for Object Detection
The visual cues from multiple support regions of different sizes and resolutions are complementary in classifying a candidate box in object detection. How to effectively integrate local and contextual visual cues from these regions has become a fundamenta
- PDF / 1,625,394 Bytes
- 16 Pages / 439.37 x 666.142 pts Page_size
- 84 Downloads / 186 Views
The Chinese University of Hong Kong, Hong Kong, China {xyzeng,wlouyang,xgwang}@ee.cuhk.edu.hk 2 Sensetime Group Limited, Sha Tin, Hong Kong {yangbin,yanjunjie}@sensetime.com
Abstract. The visual cues from multiple support regions of different sizes and resolutions are complementary in classifying a candidate box in object detection. How to effectively integrate local and contextual visual cues from these regions has become a fundamental problem in object detection. Most existing works simply concatenated features or scores obtained from support regions. In this paper, we proposal a novel gated bi-directional CNN (GBD-Net) to pass messages between features from different support regions during both feature learning and feature extraction. Such message passing can be implemented through convolution in two directions and can be conducted in various layers. Therefore, local and contextual visual patterns can validate the existence of each other by learning their nonlinear relationships and their close iterations are modeled in a much more complex way. It is also shown that message passing is not always helpful depending on individual samples. Gated functions are further introduced to control message transmission and their on-and-off is controlled by extra visual evidence from the input sample. GBD-Net is implemented under the Fast RCNN detection framework. Its effectiveness is shown through experiments on three object detection datasets, ImageNet, Pascal VOC2007 and Microsoft COCO.
1
Introduction
Object detection is one of the fundamental vision problems. It provides basic information for semantic understanding of images and videos and has attracted a lot of attentions. Detection is regarded as a problem classifying candidate boxes. Due to large variations in viewpoints, poses, occlusions, lighting conditions and background, object detection is challenging. Recently, convolutional neural networks (CNNs) have been proved to be effective for object detection [1–4] because of its power in learning features. In object detection, a candidate box is counted as true-positive for an object category if the intersection-over-union (IOU) between the candidate box and the ground-truth box is greater than a threshold. When a candidate box cover a part of the ground-truth regions, there are some potential problems. – Visual cues in this candidate box may not be sufficient to distinguish object categories. Take the candidate boxes in Fig. 1(a) for example, they cover parts c Springer International Publishing AG 2016 B. Leibe et al. (Eds.): ECCV 2016, Part VII, LNCS 9911, pp. 354–369, 2016. DOI: 10.1007/978-3-319-46478-7 22
Gated Bi-directional CNN for Object Detection
355
of bodies and have similar visual cues, but with different ground-truth class labels. It is hard to distinguish their class labels without information from larger surrounding regions of the candidate boxes. – Classification on the candidate boxes depends on the occlusion status, which has to be inferred from larger surrounding regions. Because of occlusion, the cand
Data Loading...