Multi-level Net: A Visual Saliency Prediction Model

State of the art approaches for saliency prediction are based on Fully Convolutional Networks, in which saliency maps are built using the last layer. In contrast, we here present a novel model that predicts saliency maps exploiting a non-linear combinatio

  • PDF / 2,518,407 Bytes
  • 14 Pages / 439.37 x 666.142 pts Page_size
  • 73 Downloads / 200 Views

DOWNLOAD

REPORT


stract. State of the art approaches for saliency prediction are based on Fully Convolutional Networks, in which saliency maps are built using the last layer. In contrast, we here present a novel model that predicts saliency maps exploiting a non-linear combination of features coming from different layers of the network. We also present a new loss function to deal with the imbalance issue on saliency masks. Extensive results on three public datasets demonstrate the robustness of our solution. Our model outperforms the state of the art on SALICON, which is the largest and unconstrained dataset available, and obtains competitive results on MIT300 and CAT2000 benchmarks. Keywords: Visual saliency · Saliency prediction · Convolutional neural network · Deep learning

1

Introduction

For many applications in image and video compression, video re-targeting and object segmentation, estimating where humans look in a scene is an essential step [6,9,22]. Neuroscientists [2], and more recently computer vision researches [13], have proposed computational saliency models to predict eye fixations over images. Most traditional approaches typically cope with this task by defining handcrafted and multi-scale features that capture a large spectrum of stimuli: lowerlevel features (color, texture, contrast) [11] or higher-level concepts (faces, people, text, horizon) [5]. In addition, since there is a strong tendency to look more frequently around the center of the scene than around the periphery [33], some techniques incorporate hand-crafted priors into saliency maps [19,20,35,36]. Unfortunately, eye fixation can depend on several aspects and this makes it difficult to design properly hand-crafted features. Deep learning techniques, with their ability to automatically learn appropriate features from massive annotated data, have shown impressive results in several vision applications such as image classification [18] and semantic segmentation [24]. First attempts to define saliency models with the usage of deep c Springer International Publishing Switzerland 2016  G. Hua and H. J´ egou (Eds.): ECCV 2016 Workshops, Part II, LNCS 9914, pp. 302–315, 2016. DOI: 10.1007/978-3-319-48881-3 21

Multi-level Net: A Visual Saliency Prediction Model

303

convolutional networks have recently been presented [20,35]. However, due to the small amount of training data in this scenario, researchers have presented networks with few layers or pretrained in other contexts. By publishing the large dataset SALICON [12], collected thanks to crowd-sourcing techniques, researches have then increased the number of convolutional layers reducing the overfitting risk [19,25]. In this paper we present a general deep learning framework to predict saliency maps, called ML-Net. Differently from the previous deep learning approaches, that build saliency images based on the last convolutional layer, we propose a network that is able to combine multiple features coming from different layers of the network. The proposed solution is also able to learn its own prior from the training