Image captions: global-local and joint signals attention model (GL-JSAM)

PDF / 1,413,676 Bytes
20 Pages / 439.37 x 666.142 pts Page_size
94 Downloads / 151 Views

Image captions: global-local and joint signals attention model (GL-JSAM) Nuzhat Naqvi 1

& ZhongFu Ye

1

Received: 30 May 2019 / Revised: 11 May 2020 / Accepted: 27 May 2020 # Springer Science+Business Media, LLC, part of Springer Nature 2020

Abstract

For automated visual captioning, existing neural encoder-decoder methods commonly use a simple sequence-to-sequence or an attention-based mechanism. The attention-based models pay attention to specific visual areas or objects; using a single heat map that indicates which portion of the image is most important rather than treating the objects (within the image) equally. These models are usually a mixture of Convolutional Neural Network (CNN) and Recurrent Neural Network (RNN) architectures. CNN’s generally extract global visual signals that only provide global information of main objects, attributes, and their relationship, but fail to provide local (regional) information within objects, such as lines, corners, curve and edges. On one hand, missing some of the information and details of local visual signals may lead to misprediction, misidentification of objects or completely missing the main object(s). On the other hand, additional visual signals information produces meaningless and irrelevant description, which may be coming from objects in foreground or background. To address these concerns, we created a new joint signals attention image captioning model for global and local signals that is adaptive by nature. Primarily, proposed model extracts global visual signals at imagelevel and local visual signals at object-level. The joint signal attention model (JSAM) plays a dual role in visual signal extraction and non-visual signal prediction. Initially, JSAM selects meaningful global and regional visual signals to discard irrelevant visual signals and integrates selected visual signals smartly. Subsequently, in a language model, smart JSAM decides at each time-step (level) on how to attend visual or non-visual signals to generate accurate, descriptive, and elegant sentences. Lastly, we examine the efficiency and superiority of the projected model over recent similar image captioning models by conducting essential experimentations on the MS-COCO dataset. Keywords Image captioning . Global and local signals . Soft and hard visual attention . CNN . RNN . LSTM . and Faster-RCNN

* Nuzhat Naqvi [email protected] Extended author information available on the last page of the article

Multimedia Tools and Applications

1 Introduction Automatically forming caption for a picture is an ultimate problem in computer vision and natural language processing. Translation of visual contents into natural language with correct grammatical structure [10, 26] is another big challenge. The meaningful visual description requires an algorithm that not only recognizes objects within an image, but also diagnoses relations among the objects. Correct identification of activities and attributes supports to describe the semantic information through natural language [2]. Typical image c

Data Loading...

Image captions: global-local and joint signals attention model (GL-JSAM)

Recommend Documents

Deep Joint Image Filtering

Bidirectional LSTM with Attention Mechanism for Automatic Bangla News Categorization in Terms of News Captions

Joint Extraction of Entity and Semantic Relation Using Encoder - Decoder Model Based on Attention Mechanism

Feature-attention module for context-aware image-to-image translation

Visual attention model based dual watermarking for simultaneous image copyright protection and authentication

Attention Mechanism for Fashion Image Captioning

Scale channel attention network for image segmentation

Image Based Model Development and Analysis of the Human Knee Joint

Deformable Kernel Networks for Joint Image Filtering

Experimenting with Attention Mechanisms in Joint CTC-Attention Models for Russian Speech Recognition

Routing Attention Shift Network for Image Classification and Segmentation

Cross-View Image Synthesis with Deformable Convolution and Attention Mechanism