CNN-based single object detection and tracking in videos and its application to drone detection
- PDF / 1,812,927 Bytes
- 12 Pages / 439.642 x 666.49 pts Page_size
- 48 Downloads / 230 Views
CNN-based single object detection and tracking in videos and its application to drone detection Dong-Hyun Lee1 Received: 21 December 2019 / Revised: 30 June 2020 / Accepted: 17 September 2020 / © Springer Science+Business Media, LLC, part of Springer Nature 2020
Abstract This paper presents convolutional neural network (CNN)-based single object detection and tracking algorithms. CNN-based object detection methods are directly applicable to static images, but not to videos. On the other hand, model-free visual object tracking methods cannot detect an object until a ground truth bounding box of the target is provided. Moreover, many annotated video datasets of the target object are required to train both the object detectors and visual trackers. In this work, three simple yet effective object detection and tracking algorithms for videos are proposed to efficiently combine a state-of-the-art object detector and visual tracker for circumstances in which only a few static images of the target are available for training. The proposed algorithms are tested using a drone detection task and the experimental results demonstrated their effectiveness. Keywords Object detection · Object tracking · Convolutional neural network · Drone detection
1 Introduction Image processing and computer vision are core technologies for various applications such as visual inspection, surveillance, self-driving vehicles, and robotic systems [8, 9, 20, 21]. Particularly, object detection in static images and object tracking in videos have recently gained attention in computer vision due to the emergence of deep convolutional neural networks (CNNs). Deep CNN-based object detection algorithms, such as VGG, GoogLeNet, and YOLOv3, have recently shown successful results in object detection tasks [26, 31, 33]. VGG uses small receptive fields that have a large number of weight layers with more nonlinear rectification layers to decrease the number of parameters and increase non-linearity [31]. GoogLeNet introduces the inception module with different kernel sizes to reduce computational expense and extract various kinds of feature maps [33]. YOLOv3, an improved
Dong-Hyun Lee
[email protected] 1
Department of IT Convergence Engineering, Electronic Engineering, Kumoh National Institute of Technology, Gumi, Gyeongbuk, Korea
Multimedia Tools and Applications
version of YOLOv1 and v2, uses Darknet-53 and residual skip connections [24–26]. It uses multiple feature maps with three different scales for better detection of objects of various sizes, which are useful in size varying object detection applications such as anti-drone systems [30]. These deep CNN-based object detection algorithms are widely used in many applications such as action and gesture recognition [17, 37]. They provide excellent results for object detection in static images, but not in videos, as they are not trained to deal with motion blur, illumination changes, and temporal information in videos. The model-free visual object tracking (VOT) task is another challenging area that aim
Data Loading...