Fully-Convolutional Siamese Networks for Object Tracking
The problem of arbitrary object tracking has traditionally been tackled by learning a model of the object’s appearance exclusively online, using as sole training data the video itself. Despite the success of these methods, their online-only approach inher
- PDF / 2,813,612 Bytes
- 16 Pages / 439.37 x 666.142 pts Page_size
- 73 Downloads / 232 Views
Abstract. The problem of arbitrary object tracking has traditionally been tackled by learning a model of the object’s appearance exclusively online, using as sole training data the video itself. Despite the success of these methods, their online-only approach inherently limits the richness of the model they can learn. Recently, several attempts have been made to exploit the expressive power of deep convolutional networks. However, when the object to track is not known beforehand, it is necessary to perform Stochastic Gradient Descent online to adapt the weights of the network, severely compromising the speed of the system. In this paper we equip a basic tracking algorithm with a novel fully-convolutional Siamese network trained end-to-end on the ILSVRC15 dataset for object detection in video. Our tracker operates at frame-rates beyond real-time and, despite its extreme simplicity, achieves state-of-the-art performance in multiple benchmarks.
Keywords: Object-tracking Deep-learning
1
·
Siamese-network
·
Similarity-learning
·
Introduction
We consider the problem of tracking an arbitrary object in video, where the object is identified solely by a rectangle in the first frame. Since the algorithm may be requested to track any arbitrary object, it is impossible to have already gathered data and trained a specific detector. For several years, the most successful paradigm for this scenario has been to learn a model of the object’s appearance in an online fashion using examples extracted from the video itself [1]. This owes in large part to the demonstrated ability of methods like TLD [2], Struck [3] and KCF [4]. However, a clear deficiency of using data derived exclusively from the current video is that only comparatively simple models can be learnt. While other problems in computer
The first two authors contributed equally, and are listed in alphabetical order. c Springer International Publishing Switzerland 2016 G. Hua and H. J´ egou (Eds.): ECCV 2016 Workshops, Part II, LNCS 9914, pp. 850–865, 2016. DOI: 10.1007/978-3-319-48881-3 56
Fully-Convolutional Siamese Networks for Object Tracking
851
vision have seen an increasingly pervasive adoption of deep convolutional networks (conv-nets) trained from large supervised datasets, the scarcity of supervised data and the constraint of real-time operation prevent the naive application of deep learning within this paradigm of learning a detector per video. Several recent works have aimed to overcome this limitation using a pretrained deep conv-net that was learnt for a different but related task. These approaches either apply “shallow” methods (e.g. correlation filters) using the network’s internal representation as features [5,6] or perform SGD (stochastic gradient descent) to fine-tune multiple layers of the network [7–9]. While the use of shallow methods does not take full advantage of the benefits of end-to-end learning, methods that apply SGD during tracking to achieve state-of-the-art results have not been able to operate in real-time. We advocate an alternative app
Data Loading...