A Unified Multi-scale Deep Convolutional Neural Network for Fast Object Detection

A unified deep neural network, denoted the multi-scale CNN (MS-CNN), is proposed for fast multi-scale object detection. The MS-CNN consists of a proposal sub-network and a detection sub-network. In the proposal sub-network, detection is performed at multi

  • PDF / 847,054 Bytes
  • 17 Pages / 439.37 x 666.142 pts Page_size
  • 46 Downloads / 294 Views

DOWNLOAD

REPORT


2

SVCL, UC San Diego, San Diego, USA {zwcai,nuno}@ucsd.edu IBM T. J. Watson Research, Yorktown Heights, USA {qfan,rsferi}@us.ibm.com

Abstract. A unified deep neural network, denoted the multi-scale CNN (MS-CNN), is proposed for fast multi-scale object detection. The MS-CNN consists of a proposal sub-network and a detection subnetwork. In the proposal sub-network, detection is performed at multiple output layers, so that receptive fields match objects of different scales. These complementary scale-specific detectors are combined to produce a strong multi-scale object detector. The unified network is learned endto-end, by optimizing a multi-task loss. Feature upsampling by deconvolution is also explored, as an alternative to input upsampling, to reduce the memory and computation costs. State-of-the-art object detection performance, at up to 15 fps, is reported on datasets, such as KITTI and Caltech, containing a substantial number of small objects.

Keywords: Object detection

1

· Multi-scale · Unified neural network

Introduction

Classical object detectors, based on the sliding window paradigm, search for objects at multiple scales and aspect ratios. While real-time detectors are available for certain classes of objects, e.g. faces or pedestrians [1,2], it has proven difficult to build detectors of multiple object classes under this paradigm. Recently, there has been interest in detectors derived from deep convolutional neural networks (CNNs) [3–7]. While these have shown much greater ability to address the multiclass problem, less progress has been made towards the detection of objects at multiple scales. The R-CNN [3] samples object proposals at multiple scales, using a preliminary attention stage [8], and then warps these proposals to the size (e.g. 224 × 224) supported by the CNN. This is, however, very inefficient from a computational standpoint. The development of an effective and computationally efficient region proposal mechanism is still an open problem. The more recent Faster-RCNN [9] addresses the issue with a region proposal network (RPN), which enables end-to-end training. However, the RPN generates proposals of multiple scales by sliding a fixed set of filters over a fixed set of convolutional feature maps. This creates an inconsistency between the sizes of c Springer International Publishing AG 2016  B. Leibe et al. (Eds.): ECCV 2016, Part IV, LNCS 9908, pp. 354–370, 2016. DOI: 10.1007/978-3-319-46493-0 22

A Unified Multi-scale Deep Convolutional Neural Network

355

Fig. 1. In natural images, objects can appear at very different scales, as illustrated by the yellow bounding boxes. A single receptive field, such as that of the RPN [9] (shown in the shaded area), cannot match this variability.

objects, which are variable, and filter receptive fields, which are fixed. As shown in Fig. 1, a fixed receptive field cannot cover the multiple scales at which objects appear in natural scenes. This compromises detection performance, which tends to be particularly poor for small objects, like that in the center of Fig. 1. In