MARS: A Video Benchmark for Large-Scale Person Re-Identification

This paper considers person re-identification (re-id) in videos. We introduce a new video re-id dataset, named Motion Analysis and Re-identification Set (MARS), a video extension of the Market-1501 dataset. To our knowledge, MARS is the largest video re-i

  • PDF / 2,922,155 Bytes
  • 17 Pages / 439.37 x 666.142 pts Page_size
  • 82 Downloads / 249 Views

DOWNLOAD

REPORT


Tsinghua University, Beijing, China [email protected], [email protected] 2 Microsoft Research, Beijing, China 3 UTSA, San Antonio, USA 4 Peking University, Beijing, China

Abstract. This paper considers person re-identification (re-id) in videos. We introduce a new video re-id dataset, named Motion Analysis and Re-identification Set (MARS), a video extension of the Market1501 dataset. To our knowledge, MARS is the largest video re-id dataset to date. Containing 1,261 IDs and around 20,000 tracklets, it provides rich visual information compared to image-based datasets. Meanwhile, MARS reaches a step closer to practice. The tracklets are automatically generated by the Deformable Part Model (DPM) as pedestrian detector and the GMMCP tracker. A number of false detection/tracking results are also included as distractors which would exist predominantly in practical video databases. Extensive evaluation of the state-of-the-art methods including the space-time descriptors and CNN is presented. We show that CNN in classification mode can be trained from scratch using the consecutive bounding boxes of each identity. The learned CNN embedding outperforms other competing methods considerably and has good generalization ability on other video re-id datasets upon fine-tuning.

Keywords: Video person re-identification

1

· Motion features · CNN

Introduction

Person re-identification, as a promising way towards automatic VIDEO surveillance, has been mostly studied in pre-defined IMAGE bounding boxes (bbox). Impressive progress has been observed with image-based re-id. However, rich information contained in video sequences (or tracklets) remains under-explored. In the generation of video database, pedestrian detectors [11] and offline trackers [7] are readily available. So it is natural to extract tracklets instead of single (or multiple) bboxes. This paper, among a few contemporary works [25,29,36,38,41], makes initial attempts on video-based re-identification. The dataset and codes are available at http://www.liangzheng.com.cn. c Springer International Publishing AG 2016  B. Leibe et al. (Eds.): ECCV 2016, Part VI, LNCS 9910, pp. 868–884, 2016. DOI: 10.1007/978-3-319-46466-4 52

MARS: A Video Benchmark for Large-Scale Person Re-Identification

869

With respect to the “probe-to-gallery” pattern, there are four re-id strategies: image-to-image, image-to-video, video-to-image, and video-to-video. Among them, the first mode is mostly studied in literature, and previous methods in image-based re-id [5,24,35] are developed in adaptation to the poor amount of training data. The second mode can be viewed as a special case of “multi-shot”, and the third one involves multiple queries. Intuitively, the video-to-video pattern, which is our focus in this paper, is more favorable because both probe and gallery units contain much richer visual information than single images. Empirical evidences confirm that the video-to-video strategy is superior to the others (Fig. 3). Currently, a few video re-id datasets exist [4,15,28,36]. They are limited