SSD: Single Shot MultiBox Detector

We present a method for detecting objects in images using a single deep neural network. Our approach, named SSD, discretizes the output space of bounding boxes into a set of default boxes over different aspect ratios and scales per feature map location. A

  • PDF / 10,276,204 Bytes
  • 17 Pages / 439.37 x 666.142 pts Page_size
  • 72 Downloads / 189 Views

DOWNLOAD

REPORT


4

UNC Chapel Hill, Chapel Hill, USA {wliu,cyfu,aberg}@cs.unc.edu 2 Zoox Inc., Palo Alto, USA [email protected] 3 Google Inc., Mountain View, USA {dumitru,szegedy}@google.com University of Michigan, Ann-Arbor, USA [email protected]

Abstract. We present a method for detecting objects in images using a single deep neural network. Our approach, named SSD, discretizes the output space of bounding boxes into a set of default boxes over different aspect ratios and scales per feature map location. At prediction time, the network generates scores for the presence of each object category in each default box and produces adjustments to the box to better match the object shape. Additionally, the network combines predictions from multiple feature maps with different resolutions to naturally handle objects of various sizes. SSD is simple relative to methods that require object proposals because it completely eliminates proposal generation and subsequent pixel or feature resampling stages and encapsulates all computation in a single network. This makes SSD easy to train and straightforward to integrate into systems that require a detection component. Experimental results on the PASCAL VOC, COCO, and ILSVRC datasets confirm that SSD has competitive accuracy to methods that utilize an additional object proposal step and is much faster, while providing a unified framework for both training and inference. For 300 × 300 input, SSD achieves 74.3 % mAP on VOC2007 test at 59 FPS on a Nvidia Titan X and for 512 × 512 input, SSD achieves 76.9 % mAP, outperforming a comparable state of the art Faster R-CNN model. Compared to other single stage methods, SSD has much better accuracy even with a smaller input image size. Code is available at https://github.com/weiliu89/caffe/ tree/ssd.

Keywords: Real-time object detection

1

· Convolutional neural network

Introduction

Current state-of-the-art object detection systems are variants of the following approach: hypothesize bounding boxes, resample pixels or features for each box, and c Springer International Publishing AG 2016  B. Leibe et al. (Eds.): ECCV 2016, Part I, LNCS 9905, pp. 21–37, 2016. DOI: 10.1007/978-3-319-46448-0 2

22

W. Liu et al.

apply a high-quality classifier. This pipeline has prevailed on detection benchmarks since the Selective Search work [1] through the current leading results on PASCAL VOC, COCO, and ILSVRC detection all based on Faster R-CNN [2] albeit with deeper features such as [3]. While accurate, these approaches have been too computationally intensive for embedded systems and, even with highend hardware, too slow for real-time applications. Often detection speed for these approaches is measured in frames per second, and even the fastest high-accuracy detector, Faster R-CNN, operates at only 7 frames per second (FPS). There have been many attempts to build faster detectors by attacking each stage of the detection pipeline (see related work in Sect. 4), but so far, significantly increased speed comes only at the cost of significantly decreased detection accuracy. Thi