A part-based spatial and temporal aggregation method for dynamic scene recognition

PDF / 1,730,371 Bytes
18 Pages / 595.276 x 790.866 pts Page_size
52 Downloads / 188 Views

(0123456789().,-volV)(0123456789(). ,- volV)

S.I. : DICTA 2019

A part-based spatial and temporal aggregation method for dynamic scene recognition Xiaoming Peng1,2

•

Abdesselam Bouzerdoum1,3 • Son Lam Phung1

Received: 10 February 2020 / Accepted: 5 October 2020 Ó Springer-Verlag London Ltd., part of Springer Nature 2020

Abstract Existing methods for dynamic scene recognition mostly use global features extracted from the entire video frame or a video segment. In this paper, a part-based method is proposed to aggregate local features from video frames. A pre-trained Fast R-CNN model is used to extract local convolutional features from the regions of interest of training images. These features are clustered to locate representative parts. A set cover problem is then formulated to select the discriminative parts, which are further refined by fine-tuning the Fast R-CNN model. Local features from a video segment are extracted at different layers of the fine-tuned Fast R-CNN model and aggregated both spatially and temporally. Extensive experimental results show that the proposed method is very competitive with state-of-the-art approaches. Keywords Dynamic scene recognition Feature aggregation Deep neural networks Part-based models

1 Introduction The task of scene recognition (also called categorization or classification) aims to recognize the semantic label of a given scene, e.g., a beach, an indoor environment, or a city street. Significant effort has been devoted to scene recognition This submission is the extended version of a paper published in the 2019 International Conference on Digital Image Computing: Techniques and Applications (DICTA’19). Compared with the conference paper, this submission is much more detailed in the related work and method description and provides much more extensive experimental results. & Xiaoming Peng xp815@uowmail.edu.au Abdesselam Bouzerdoum bouzer@uow.edu.au Son Lam Phung phung@uow.edu.au 1

School of Electrical, Computer and Telecommunications Engineering, University of Wollongong, Wollongong, Australia

2

School of Automation Engineering, University of Electronic Science and Technology of China, Chengdu, Sichuan, China

3

Division of Information and Computing Technology, College of Science and Engineering, Hamad Bin Khalifa University, Doha, Qatar

using static images [1–3]. Recently, more attention has been devoted to video-based scene recognition, with the view that the temporal information embedded in videos can assist visual recognition [4–15]. Here, we distinguish dynamic scene recognition from dynamic texture recognition [16] and human activity recognition [17]. Dynamic scenes differ from dynamic textures in two respects [12]. First, dynamic scenes are natural scenes evolving overtime, whereas dynamic textures contain richer texture information. Second, videos of dynamic scenes may contain significant camera movements, while videos of dynamic textures are usually more stable. Dynamic scene recognition also differs from human activity recognition in that the aim of t

Data Loading...

A part-based spatial and temporal aggregation method for dynamic scene recognition

Recommend Documents

Spatial-Temporal Co-attention Network for Action Recognition

Temporal Aggregation

Path Aggregation and Dual Supervision Network for Scene Text Detection

An Image Recognition Method Based on Scene Semantics

Skeleton-Based Action Recognition with Dense Spatial Temporal Graph Network

Aggregation Query, Spatial

A Class Incremental Temporal-Spatial Model Based on Wireless Sensor Networks for Activity Recognition

Spatial and Temporal Reasoning

Deep Learning for Scene Recognition from Visual Data: A Survey

Spatial and Temporal Pricing Approach for Tasks in Spatial Crowdsourcing

Dynamic Lexicon Generation for Natural Scene Images

Scene Text Recognition and Retrieval for Large Lexicons