STA-Net: spatial-temporal attention network for video salient object detection

  • PDF / 2,700,045 Bytes
  • 10 Pages / 595.224 x 790.955 pts Page_size
  • 44 Downloads / 213 Views

DOWNLOAD

REPORT


STA-Net: spatial-temporal attention network for video salient object detection Hong-Bo Bi1

· Di Lu1 · Hui-Hui Zhu1 · Li-Na Yang1 · Hua-Ping Guan2

Accepted: 18 September 2020 © Springer Science+Business Media, LLC, part of Springer Nature 2020

Abstract This paper conducts a systematic study on the role of spatial and temporal attention mechanism in the video salient object detection (VSOD) task. We present a two-stage spatial-temporal attention network, named STA-Net, which makes two major contributions. In the first stage, we devise a Multi-Scale-Spatial-Attention (MSSA) module to reduce calculation cost on nonsalient regions while exploiting multi-scale saliency information. Such a sliced attention method offers an individual way to efficiently exploit the high-level features of the network with an enlarged receptive field. The second stage is to propose a Pyramid-Saliency-Shift-Aware (PSSA) module, which puts emphasis on the importance of dynamic object information since it offers a valid shift cue to confirm salient object and capture temporal information. Such a temporal detection module is able to encourage precise salient region detection. Exhaustive experiments show that the proposed STA-Net is effective for video salient object detection task, and achieves compelling performance in comparison with state-of-the-art. Keywords Multi-scale · Video salient object detection · Attention · Pyramid

1 Introduction

1.1 Traditional VSOD models

VSOD aims at finding ‘where’ the visual outstanding object instance is when given a video. This paper focuses on video salient object detection (VSOD), which is the basis of other related visual tasks, such as video segmentation [1], video tracking [2], video captioning [3], video compression [4], etc. Recently, the methods are usually grouped into two categories. The one is traditional models, and the other is deep learning models. For a long time, the lowlevel features and its variants have been the dominant methods among traditional VSOD. With the rapid progress of artificial intelligence, VSOD models based on deep convolutional neural networks have yielded remarkable results and become a new trend.

Traditional VSOD methods [5–7] usually employed handcrafted low-level features and other math models to automatically confirm the salient foreground objects. These hand-crafted features cannot effectively describe the integral structure of salient objects, which leads to the occurrence of false detections and missed detections. Therefore, how to quickly and accurately capture salient objects has attracted researchers’ attention. The traditional methods for video salient object detection are usually originated from background prior [8], centersurround contrast [9], feature integration [10], cognitive of visual attention. They integrated multiple theories effectively through different computational mechanisms. However, such methods are unsatisfactory in real-time, resourcedemanding, which cannot completely locate salient objects. For example, a salient object detector based on