A part-based spatial and temporal aggregation method for dynamic scene recognition

  • PDF / 1,730,371 Bytes
  • 18 Pages / 595.276 x 790.866 pts Page_size
  • 52 Downloads / 161 Views

DOWNLOAD

REPORT


(0123456789().,-volV)(0123456789(). ,- volV)

S.I. : DICTA 2019

A part-based spatial and temporal aggregation method for dynamic scene recognition Xiaoming Peng1,2



Abdesselam Bouzerdoum1,3 • Son Lam Phung1

Received: 10 February 2020 / Accepted: 5 October 2020 Ó Springer-Verlag London Ltd., part of Springer Nature 2020

Abstract Existing methods for dynamic scene recognition mostly use global features extracted from the entire video frame or a video segment. In this paper, a part-based method is proposed to aggregate local features from video frames. A pre-trained Fast R-CNN model is used to extract local convolutional features from the regions of interest of training images. These features are clustered to locate representative parts. A set cover problem is then formulated to select the discriminative parts, which are further refined by fine-tuning the Fast R-CNN model. Local features from a video segment are extracted at different layers of the fine-tuned Fast R-CNN model and aggregated both spatially and temporally. Extensive experimental results show that the proposed method is very competitive with state-of-the-art approaches. Keywords Dynamic scene recognition  Feature aggregation  Deep neural networks  Part-based models

1 Introduction The task of scene recognition (also called categorization or classification) aims to recognize the semantic label of a given scene, e.g., a beach, an indoor environment, or a city street. Significant effort has been devoted to scene recognition This submission is the extended version of a paper published in the 2019 International Conference on Digital Image Computing: Techniques and Applications (DICTA’19). Compared with the conference paper, this submission is much more detailed in the related work and method description and provides much more extensive experimental results. & Xiaoming Peng [email protected] Abdesselam Bouzerdoum [email protected] Son Lam Phung [email protected] 1

School of Electrical, Computer and Telecommunications Engineering, University of Wollongong, Wollongong, Australia

2

School of Automation Engineering, University of Electronic Science and Technology of China, Chengdu, Sichuan, China

3

Division of Information and Computing Technology, College of Science and Engineering, Hamad Bin Khalifa University, Doha, Qatar

using static images [1–3]. Recently, more attention has been devoted to video-based scene recognition, with the view that the temporal information embedded in videos can assist visual recognition [4–15]. Here, we distinguish dynamic scene recognition from dynamic texture recognition [16] and human activity recognition [17]. Dynamic scenes differ from dynamic textures in two respects [12]. First, dynamic scenes are natural scenes evolving overtime, whereas dynamic textures contain richer texture information. Second, videos of dynamic scenes may contain significant camera movements, while videos of dynamic textures are usually more stable. Dynamic scene recognition also differs from human activity recognition in that the aim of t