Hierarchical Deep Neural Network for Image Captioning

  • PDF / 736,556 Bytes
  • 11 Pages / 439.37 x 666.142 pts Page_size
  • 57 Downloads / 323 Views

DOWNLOAD

REPORT


Hierarchical Deep Neural Network for Image Captioning Yuting Su1 · Yuqian Li1 · Ning Xu1 · An-An Liu1

© Springer Science+Business Media, LLC, part of Springer Nature 2019

Abstract Automatically describing image content with natural language is a fundamental challenge for computer vision community. General methods used visual information to generate sentences directly. However, only depending on the visual information is not enough to generate the finegrained descriptions for given images. In this paper, we exploit the fusion of visual information and high-level semantic information for image captioning. We propose a hierarchical deep neural network, which consists of the bottom layer and the top layer. The former extracts the visual and high-level semantic information from image and detected regions, respectively, while the latter integrates both of them with adaptive attention mechanism for the caption generation. The experimental results achieve the competing performances against the stateof-the-art methods on MSCOCO dataset. Keywords Regional semantic · Image captioning · Attention mechanism

1 Introduction The problem of describing images using natural language has garnered a lot of interest due to the recent development of Recurrent Neural Networks (RNN) [1,29]. Given an input image, the generated sentences describe image content, ideally encapsulating its most informative dynamics. There are a wide variety of image applications based on the description, ranging from editing, indexing, search, to sharing. Moreover, this work is helpful for the impaired individuals to navigate their surroundings. The task of image captioning has been studied for over a decade, which can be divided into three directions. First, image captioning framework is the encoder–decoder framework [18,37], where the Convolutional Neural Network (CNN) extracts the image feature, which are then fed into the Recurrent Neural Network (RNN) to generate sentences [1,5,16,29,34]. However, these models exploit the caption generation based on visual information directly while they ignore the high-level semantic within images [42]. Second, it has been verified that the attribute-based methods, which contain the rich semantic cues, are effective for the

B B 1

Ning Xu [email protected] An-An Liu [email protected] School of Electrical and Information Engineering, Tianjin University, Tianjin 300072, China

123

Y. Su et al.

captioning task [28,38,42]. In other words, capturing the fine-grained information is helpful during the caption generation. Third, the attention-based methods [4,27,29] can improve the performances by learning salient regions with the deep neural network. Existing methods cannot bridge semantic information and visual information, while the semantic information plays an important role to describe the image content. In summary, how to integrate the visual and semantic information to generate the fine-grained image description is seldom investigated. In this paper, we propose a hierarchical deep neural network for image captioni

Data Loading...