Boosting image caption generation with feature fusion module

  • PDF / 1,224,756 Bytes
  • 15 Pages / 439.642 x 666.49 pts Page_size
  • 88 Downloads / 179 Views

DOWNLOAD

REPORT


Boosting image caption generation with feature fusion module Pengfei Xia1 · Jingsong He1

· Jin Yin1

Received: 12 June 2019 / Revised: 5 April 2020 / Accepted: 27 May 2020 / © Springer Science+Business Media, LLC, part of Springer Nature 2020

Abstract Image caption generation has been considered as a key issue on vision-to-language tasks. Using the classification model, such as AlexNet, VGG and ResNet as the encoder to extract image features is very common in previous work. However, there is an explicit gap in image feature requirements between caption task and classification task, and has not been widely concerned. In this paper, we propose a novel custom structure, named feature fusion module (FFM), to make the features extracted by the encoder more suitable for caption task. We evaluate the proposed module with two typical models, NIC (Neural Image Caption) and SA (Soft Attention), on two popular benchmarks, MS COCO and Flickr30k. It is consistently observed that FFM is able to boost the performance, and outperforms state-of-the-art methods over five metrics. Keywords Image caption · Feature fusion module · Encoder-decoder model

1 Introduction Recent years, there are a lot of attention to vision-to-language problems, which require a combination of linguistic and visual information [7]. Among plenty of such tasks, e.g., visual question answering [3] and video interpreting [61], image caption generation has been considered as a key issue, whose purpose lines in generating meaningful sentences for

 Jingsong He

[email protected] Pengfei Xia [email protected] Jin Yin [email protected] 1

University of Science and Technology of China, Hefei, China

Multimedia Tools and Applications

a given image. Describing the content of an image is straightforward for a human because of the remarkable ability of refining information. However, it is difficult for machines since the generation model not only should capture, identify and understand objects, but also should be powerful enough to describe their relationships in natural language [57]. Image caption has been a challenging task in computer vision and deep learning. There are many practical applications of image caption. Generating descriptions from the surroundings can aid visually impaired people to better perceive information as same as sighted people can. Besides, an image caption system can help people or companies to manage the increasing amounts of multimedia data. Visual understanding and describing are a crucial part of robot interaction. A lot of methods have been carried out to this task and can be roughly classified into three categories, i.e., retrieval-based methods [21, 33, 49], template-based methods [17, 32, 43], and neural-based methods [27, 53, 57]. Nowadays, due to the development of deep learning, neural-based methods have become the most popular solutions and achieved stateof-the-art performance. Inspired by sequence-to-sequence learning in machine translation [13, 50] neural-based methods regard the task as a special translation fr