Unsupervised Deep Representation Learning for Real-Time Tracking

  • PDF / 2,934,120 Bytes
  • 19 Pages / 595.276 x 790.866 pts Page_size
  • 34 Downloads / 235 Views

DOWNLOAD

REPORT


Unsupervised Deep Representation Learning for Real-Time Tracking Ning Wang1 · Wengang Zhou1,2 · Yibing Song3 · Chao Ma4 · Wei Liu3 · Houqiang Li1,2 Received: 17 December 2019 / Accepted: 9 July 2020 © Springer Science+Business Media, LLC, part of Springer Nature 2020

Abstract The advancement of visual tracking has continuously been brought by deep learning models. Typically, supervised learning is employed to train these models with expensive labeled data. In order to reduce the workload of manual annotation and learn to track arbitrary objects, we propose an unsupervised learning method for visual tracking. The motivation of our unsupervised learning is that a robust tracker should be effective in bidirectional tracking. Specifically, the tracker is able to forward localize a target object in successive frames and backtrace to its initial position in the first frame. Based on such a motivation, in the training process, we measure the consistency between forward and backward trajectories to learn a robust tracker from scratch merely using unlabeled videos. We build our framework on a Siamese correlation filter network, and propose a multi-frame validation scheme and a cost-sensitive loss to facilitate unsupervised learning. Without bells and whistles, the proposed unsupervised tracker achieves the baseline accuracy of classic fully supervised trackers while achieving a real-time speed. Furthermore, our unsupervised framework exhibits a potential in leveraging more unlabeled or weakly labeled data to further improve the tracking accuracy. Keywords Visual tracking · Unsupervised learning · Correlation filter · Siamese network

1 Introduction

Communicated by Mei Chen, Cha Zhang and Katsushi Ikeuchi.

B B

Wengang Zhou [email protected] Houqiang Li [email protected] Ning Wang [email protected] Yibing Song [email protected] Chao Ma [email protected] Wei Liu [email protected]

1

The CAS Key Laboratory of GIPAS, University of Science and Technology of China, Hefei, China

2

Institute of Artificial Intelligence, Hefei Comprehensive National Science Center, Hefei, China

3

Tencent AI Lab, Shenzhen, China

4

The MOE Key Lab of Artificial Intelligence, AI Institute, Shanghai Jiao Tong University, Shanghai, China

Visual object tracking is a fundamental task in computer vision with numerous applications including video surveillance, autonomous driving, augmented reality, and humancomputer interactions. It aims to localize a moving object annotated at the initial frame with a bounding box. Recently, deep models have improved the tracking accuracies by strengthening the feature representations (Ma et al. 2015; Danelljan et al. 2016, 2017) or optimizing networks end-toend (Bertinetto et al. 2016; Li et al. 2018; Nam and Han 2016; Valmadre et al. 2017). These models are offline pretrained with full supervision, which requires a large number of annotated ground-truth labels during the training stage. Manual annotations are always expensive and time-consuming, whereas a huge number of unlabeled videos are readily