Reference-based model using multimodal gated recurrent units for image captioning

PDF / 4,001,701 Bytes
21 Pages / 439.37 x 666.142 pts Page_size
60 Downloads / 205 Views

Reference-based model using multimodal gated recurrent units for image captioning Tiago do Carmo Nogueira 1 & Cássio Dener Noronha Vinhal 1 & Gélson da Cruz Júnior 1 & Matheus Rudolfo Diedrich Ullmann 1 Received: 29 January 2020 / Revised: 9 July 2020 / Accepted: 4 August 2020 # Springer Science+Business Media, LLC, part of Springer Nature 2020

Abstract

Describing images through natural language is a challenging task in the field of computer vision. Image captioning consists of creating image descriptions that can be accomplished via deep learning architectures that use convolutional neural networks (CNNs) and recurrent neural networks (RNNs). However, traditional RNNs encounter problems such as exploding and vanishing gradients, and they exhibit poor performance when generating non-descriptive sentences. To solve these issues, we proposed a model based on the encoder–decoder structure using CNNs to extract the image features and multimodal gated recurrent units (GRU) for descriptions. This model implements the part-of-speech (PoS) and likelihood function for weight generation in the GRU. The method performs knowledge transfer during a validation phase that uses the k-nearest neighbors technique (kNN). Experimental results using the Flickr30k and MSCOCO datasets demonstrated that the proposed PoS-based model presents competitive scores in comparison to state-ofthe-art models. The system predicts more descriptive captions and closely approximates the expected captions both in the predicted and kNN selected captions. Keywords Gated recurrent units . Caption generation references . Convolutional neural network

* Tiago do Carmo Nogueira [email protected] Cássio Dener Noronha Vinhal [email protected] Gélson da Cruz Júnior [email protected] Matheus Rudolfo Diedrich Ullmann [email protected]

1

School of Electrical, Mechanical and Computer Engineering (EMC), Federal University of Goiás (UFG), Goiânia, Brazil

Multimedia Tools and Applications

1 Introduction Automatic description of images using natural language sentences has garnered significant attention in the domain of computer vision. Image captioning is a task that describes a specific image through a sequence of words or phrases [29, 39, 53]. Machine learning models can accomplish this task and are capable of encoding objects, semantic attributes, and inferring their relationships [21]. Although it is easy for humans to understand how these relationships occur, it is still a challenging task for machines. One main reason is the necessity to account for the relationships between these objects and recognize the objects contained in an image [19, 40, 52]. Several works in the literature proposed an encoder-decoder structure for accomplishing this task [19, 21, 22, 30, 32, 39, 53]. The encoder–decoder structure uses a convolutional neural network (CNN) as the encoder while extracting image features. A recurrent neural network (RNN) is used as the decoder while generating image descriptions [39, 53]. The use of models that combine global and local imag

Data Loading...

Reference-based model using multimodal gated recurrent units for image captioning

Recommend Documents

Hyperspectral Image Classification Based on Bidirectional Gated Recurrent Units

Image Captioning

Attention with Long-Term Interval-Based Gated Recurrent Units for Modeling Sequential User Behaviors

Length-Controllable Image Captioning

Attention Mechanism for Fashion Image Captioning

Hierarchical Deep Neural Network for Image Captioning

Cross-domain personalized image captioning

Bangla Text Generation Using Bidirectional Optimized Gated Recurrent Unit Network

Protein-Protein Interactions Prediction Based on Bi-directional Gated Recurrent Unit and Multimodal Representation

Image Captioning Methodologies Using Deep Learning: A Review

Boost image captioning with knowledge reasoning

Ontological Approach to Image Captioning Evaluation