Image captions: global-local and joint signals attention model (GL-JSAM)

  • PDF / 1,413,676 Bytes
  • 20 Pages / 439.37 x 666.142 pts Page_size
  • 94 Downloads / 151 Views

DOWNLOAD

REPORT


Image captions: global-local and joint signals attention model (GL-JSAM) Nuzhat Naqvi 1

& ZhongFu Ye

1

Received: 30 May 2019 / Revised: 11 May 2020 / Accepted: 27 May 2020 # Springer Science+Business Media, LLC, part of Springer Nature 2020

Abstract

For automated visual captioning, existing neural encoder-decoder methods commonly use a simple sequence-to-sequence or an attention-based mechanism. The attention-based models pay attention to specific visual areas or objects; using a single heat map that indicates which portion of the image is most important rather than treating the objects (within the image) equally. These models are usually a mixture of Convolutional Neural Network (CNN) and Recurrent Neural Network (RNN) architectures. CNN’s generally extract global visual signals that only provide global information of main objects, attributes, and their relationship, but fail to provide local (regional) information within objects, such as lines, corners, curve and edges. On one hand, missing some of the information and details of local visual signals may lead to misprediction, misidentification of objects or completely missing the main object(s). On the other hand, additional visual signals information produces meaningless and irrelevant description, which may be coming from objects in foreground or background. To address these concerns, we created a new joint signals attention image captioning model for global and local signals that is adaptive by nature. Primarily, proposed model extracts global visual signals at imagelevel and local visual signals at object-level. The joint signal attention model (JSAM) plays a dual role in visual signal extraction and non-visual signal prediction. Initially, JSAM selects meaningful global and regional visual signals to discard irrelevant visual signals and integrates selected visual signals smartly. Subsequently, in a language model, smart JSAM decides at each time-step (level) on how to attend visual or non-visual signals to generate accurate, descriptive, and elegant sentences. Lastly, we examine the efficiency and superiority of the projected model over recent similar image captioning models by conducting essential experimentations on the MS-COCO dataset. Keywords Image captioning . Global and local signals . Soft and hard visual attention . CNN . RNN . LSTM . and Faster-RCNN

* Nuzhat Naqvi [email protected] Extended author information available on the last page of the article

Multimedia Tools and Applications

1 Introduction Automatically forming caption for a picture is an ultimate problem in computer vision and natural language processing. Translation of visual contents into natural language with correct grammatical structure [10, 26] is another big challenge. The meaningful visual description requires an algorithm that not only recognizes objects within an image, but also diagnoses relations among the objects. Correct identification of activities and attributes supports to describe the semantic information through natural language [2]. Typical image c