A strong feature representation for siamese network tracker
- PDF / 2,232,556 Bytes
- 15 Pages / 439.642 x 666.49 pts Page_size
- 79 Downloads / 191 Views
A strong feature representation for siamese network tracker Zhipeng Zhou1,2 · Rui Zhang1,2 · Dong Yin1,2 Received: 15 August 2019 / Revised: 1 June 2020 / Accepted: 4 June 2020 / © Springer Science+Business Media, LLC, part of Springer Nature 2020
Abstract Because AlexNet is too shallow to form a strong feature representation, the trackers based on the Siamese network have an accuracy gap comparing with state-of-the-art algorithms. Both deep features and appearance features benefit tracking accuracy. To combine these two kinds features, the modified pre-trained VGG16 network is fine-tuned as one branch of the backbone network. Secondly, an AlexNet branch is attached after the third convolutional layer of VGG16. Thus the response maps from both branches are merged to form a preliminary strong feature representation with deep features and shallow appearance features. Thirdly, a new mean Peak-to-side ratio(mPSR) loss is designed to help network learn target features adaptively. A channel attention block and the Average-Peak-to-Correlation Energy(APCE) are designed to help select contributed features and suppress distractors. SiamPF only takes ILSVRC2015-VID as training dataset, but it achieves excellent performance on OTB-2013 / OTB-2015 / VOT2015 / VOT2016 / VOT2017 while maintaining the real-time performance of 41FPS on the GTX 1080Ti. Keywords Siamese network · Feature representation · mPSR
1 Introduction Visual tracking is a fundamental topic in computer vision. It can be divided into two subtopics base on target: single object tracking and multiple object tracking [31]. Many single object tracking methods have been studied in recent years. They are mainly based on either correlation filter framework or deep learning framework. Correlation filter was introduced to computer vision by David S. Bolme [3] who proposed a tracker named MOSSE based on correlation filter. Henriques J.F proposed a method called CSK [19], which developed the intensive sampling and the kernel trick based on MOSSE. Furthermore, he exploited multi-channel HOG feature into KCF [20], which was an enhanced vision of CSK. Similarly, Danelljan M [5] developed CSK with multi-channel color names(CN) feature. Due to Dong Yin
[email protected] 1
School of Information Science and Technology, University of Science and Technology of China, Hefei, Anhui 230027, China
2
Key Laboratory of Electromagnetic Space Information of CAS, Hefei, Anhui 230027, China
Multimedia Tools and Applications
their good performances, HOG and CN have became the most popular hand-craft features in recent years. However, hand-craft features are not suitable for all targets, which limits the performance of these trackers. Thus, leveaging data-driven features seem to a better way for target representation. Combining with features extracted from CNN, the correlation filter based methods such as DeepSRDCF [6], C-COT [9], ECO [10] certainly have a higher accuracy. On the other hand, trackers mentioned above require complex setup and high computation that could hardly meet
Data Loading...