Hierarchical Deep Neural Network for Image Captioning

PDF / 736,556 Bytes
11 Pages / 439.37 x 666.142 pts Page_size
57 Downloads / 401 Views

Hierarchical Deep Neural Network for Image Captioning Yuting Su1 · Yuqian Li1 · Ning Xu1 · An-An Liu1

© Springer Science+Business Media, LLC, part of Springer Nature 2019

Abstract Automatically describing image content with natural language is a fundamental challenge for computer vision community. General methods used visual information to generate sentences directly. However, only depending on the visual information is not enough to generate the finegrained descriptions for given images. In this paper, we exploit the fusion of visual information and high-level semantic information for image captioning. We propose a hierarchical deep neural network, which consists of the bottom layer and the top layer. The former extracts the visual and high-level semantic information from image and detected regions, respectively, while the latter integrates both of them with adaptive attention mechanism for the caption generation. The experimental results achieve the competing performances against the stateof-the-art methods on MSCOCO dataset. Keywords Regional semantic · Image captioning · Attention mechanism

1 Introduction The problem of describing images using natural language has garnered a lot of interest due to the recent development of Recurrent Neural Networks (RNN) [1,29]. Given an input image, the generated sentences describe image content, ideally encapsulating its most informative dynamics. There are a wide variety of image applications based on the description, ranging from editing, indexing, search, to sharing. Moreover, this work is helpful for the impaired individuals to navigate their surroundings. The task of image captioning has been studied for over a decade, which can be divided into three directions. First, image captioning framework is the encoder–decoder framework [18,37], where the Convolutional Neural Network (CNN) extracts the image feature, which are then fed into the Recurrent Neural Network (RNN) to generate sentences [1,5,16,29,34]. However, these models exploit the caption generation based on visual information directly while they ignore the high-level semantic within images [42]. Second, it has been verified that the attribute-based methods, which contain the rich semantic cues, are effective for the

B B 1

Ning Xu [email protected] An-An Liu [email protected] School of Electrical and Information Engineering, Tianjin University, Tianjin 300072, China

123

Y. Su et al.

captioning task [28,38,42]. In other words, capturing the fine-grained information is helpful during the caption generation. Third, the attention-based methods [4,27,29] can improve the performances by learning salient regions with the deep neural network. Existing methods cannot bridge semantic information and visual information, while the semantic information plays an important role to describe the image content. In summary, how to integrate the visual and semantic information to generate the fine-grained image description is seldom investigated. In this paper, we propose a hierarchical deep neural network for image captioni

Data Loading...

Hierarchical Deep Neural Network for Image Captioning

Recommend Documents

Image Captioning in Vietnamese Language Based on Deep Learning Network

Image Captioning

Image Captioning Methodologies Using Deep Learning: A Review

Length-Controllable Image Captioning

HACT-Net: A Hierarchical Cell-to-Tissue Graph Neural Network for Histopathological Image Classification

Attention Mechanism for Fashion Image Captioning

ThreatZoom: Hierarchical Neural Network for CVEs to CWEs Classification

Cross-domain personalized image captioning

Deep Convolutional Neural Network Based Image Segmentation for Salt Mine Recognition

RecDNN: deep neural network for image reconstruction from limited view projection data

BP Neural Network-Based Deep Non-negative Matrix Factorization for Image Clustering

DeepciRGO: functional prediction of circular RNAs through hierarchical deep neural networks using heterogeneous network