DeepVS2.0: A Saliency-Structured Deep Learning Method for Predicting Dynamic Visual Attention

PDF / 3,691,305 Bytes
22 Pages / 595.276 x 790.866 pts Page_size
23 Downloads / 213 Views

DeepVS2.0: A Saliency-Structured Deep Learning Method for Predicting Dynamic Visual Attention Lai Jiang1,2 · Mai Xu1

· Zulin Wang1 · Leonid Sigal2

Received: 30 July 2018 / Accepted: 12 August 2020 © Springer Science+Business Media, LLC, part of Springer Nature 2020

Abstract Deep neural networks (DNNs) have exhibited great success in image saliency prediction. However, few works apply DNNs to predict the saliency of generic videos. In this paper, we propose a novel DNN-based video saliency prediction method, called DeepVS2.0. Specifically, we establish a large-scale eye-tracking database of videos (LEDOV), which provides sufficient data to train the DNN models for predicting video saliency. Through the statistical analysis of LEDOV, we find that human attention is normally attracted by objects, particularly moving objects or the moving parts of objects. Accordingly, we propose an object-to-motion convolutional neural network (OM-CNN) in DeepVS2.0 to learn spatio-temporal features for predicting the intra-frame saliency via exploring the information of both objectness and object motion. We further find from our database that human attention has a temporal correlation with a smooth saliency transition across video frames. Therefore, a saliencystructured convolutional long short-term memory network (SS-ConvLSTM) is developed in DeepVS2.0 to predict inter-frame saliency, using the extracted features of OM-CNN as the input. Moreover, the center-bias dropout and sparsity-weighted loss are embedded in SS-ConvLSTM, to consider the center-bias and sparsity of human attention maps. Finally, the experimental results show that our DeepVS2.0 method advances the state-of-the-art video saliency prediction. Keywords Deep neural networks · Saliency prediction · Convolutional LSTM · Eye-tracking database · Video · Video database

1 Introduction Communicated by Antonio Torralba. This work was supported by the NSFC Projects 61922009, 61876013 and 61573037. Electronic supplementary material The online version of this article (https://doi.org/10.1007/s11263-020-01371-6) contains supplementary material, which is available to authorized users.

B

Mai Xu [email protected] Lai Jiang [email protected] Zulin Wang [email protected] Leonid Sigal [email protected]

1

School of Electronic and Information Engineering, Beihang University, Beijing, China

2

Department of Computer Science, University of British Columbia, Vancouver, BC, Canada

A foveation mechanism (Matin 1974) in the human visual system (HVS) indicates that only a small fovea region captures most visual attention at high resolution, while other peripheral regions receive little attention at low resolution. To predict human attention, saliency detection has been widely studied in recent years, with multiple applications (Borji and Itti 2013) in object recognition, object segmentation, action recognition, image captioning, and image/video compression, among others. Typically, saliency detection can be classified into saliency prediction (Itti et al. 1998) and salient obj

Data Loading...

DeepVS2.0: A Saliency-Structured Deep Learning Method for Predicting Dynamic Visual Attention

Recommend Documents

Deep Reinforced Attention Learning for Quality-Aware Visual Recognition

Deep Hierarchical Attention Flow for Visual Commonsense Reasoning

Deep Learning for Scene Recognition from Visual Data: A Survey

Reinforcement Learning for Decision Making in Sequential Visual Attention

Predicting Student Final Score Using Deep Learning

aDFR: An Attention-Based Deep Learning Model for Flight Ranking

Stable Deep Reinforcement Learning Method by Predicting Uncertainty in Rewards as a Subtask

Domain Adaptive Transfer Learning on Visual Attention Aware Data Augmentation for Fine-Grained Visual Categorization

Deep anomaly detection through visual attention in surveillance videos

A saliency-based bottom-up visual attention model for dynamic scenes analysis

Deep Learning in Mining of Visual Content

Predicting brand confusion in imagery markets based on deep learning of visual advertisement content