Siamese network for real-time tracking with action-selection

  • PDF / 1,674,784 Bytes
  • 11 Pages / 595.276 x 790.866 pts Page_size
  • 58 Downloads / 236 Views

DOWNLOAD

REPORT


ORIGINAL RESEARCH PAPER

Siamese network for real‑time tracking with action‑selection Zhuoyi Zhang1 · Yifeng Zhang1,2,3 · Xu Cheng4 · Ke Li1 Received: 22 May 2019 / Accepted: 10 October 2019 © Springer-Verlag GmbH Germany, part of Springer Nature 2019

Abstract Considering that most deep learning based trackers capture accurate locations for targets at the expense of consuming much time in training phrase, in this paper we present a new powerful tracker using the Siamese network which can be implemented with low computation resource. Our proposed tracker can track targets accurately by a fine-tuned model which is convenient to train. During the tracking, we apply a new sampling method that is independent of training called action-selection to conduct selective and flexible sampling step by step with a variable stride, by which we can get bounding boxes with varied aspect radio. By verifying its performance on online tracking benchmarks, it turns out that our tracker achieves higher accuracy than most traditional trackers. In addition, our tracker operates at frame-rates beyond real-time. Keywords  Computer vision · Object tracking · Siamese network

1 Introduction The aim of tracking is to infer the target’s position in incoming frames given the first frame’s target position. During the tracking, there are many factors that may interfere with our inference such as occlusion, scale change, uneven illumination, camera motion and other disturbing factors [1, 2]. Therefore, in order to track objects accurately, it is necessary to establish a tracker that can cope with these factors. Since we are only informed the location of an object in the first frame of the target and its position in the following frames is predicted, we believe that only the target location of the initial frame is completely accurate. Thus, we only use the object’s bounding box in the initial frame as the template for subsequent tracking and match it with the candidates in following frames. Therefore, a matching function that can * Zhuoyi Zhang [email protected] 1



School of Information Science and Engineering, Southeast University, Nanjing 210096, China

2



Nanjing Institute of Communications Technologies, Nanjing 211100, China

3

State Key Laboratory for Novel Software Technology, Nanjing University, Nanjing 210093, China

4

School of Computer and Software, Nanjing University of Information Science and Technology, Nanjing 210044, China



reflect the similarity between candidates and templates is needed. In recent years, convolutional neural networks have become popular for end-to-end learning of image representations, from which we can obtain semantic information about targets [3–5]. As a Siamese network is insensitive to appearance changes, we utilize it as a matching function based on convolutional neural networks [6–8]. There are some Siamese network based trackers perform excellently when capturing objects, but they all suffer from consuming too much time for training [9, 10] or needing abundant data to train [11, 12]. As a result, i