Reference-based model using multimodal gated recurrent units for image captioning
- PDF / 4,001,701 Bytes
- 21 Pages / 439.37 x 666.142 pts Page_size
- 60 Downloads / 180 Views
Reference-based model using multimodal gated recurrent units for image captioning Tiago do Carmo Nogueira 1 & Cássio Dener Noronha Vinhal 1 & Gélson da Cruz Júnior 1 & Matheus Rudolfo Diedrich Ullmann 1 Received: 29 January 2020 / Revised: 9 July 2020 / Accepted: 4 August 2020 # Springer Science+Business Media, LLC, part of Springer Nature 2020
Abstract
Describing images through natural language is a challenging task in the field of computer vision. Image captioning consists of creating image descriptions that can be accomplished via deep learning architectures that use convolutional neural networks (CNNs) and recurrent neural networks (RNNs). However, traditional RNNs encounter problems such as exploding and vanishing gradients, and they exhibit poor performance when generating non-descriptive sentences. To solve these issues, we proposed a model based on the encoder–decoder structure using CNNs to extract the image features and multimodal gated recurrent units (GRU) for descriptions. This model implements the part-of-speech (PoS) and likelihood function for weight generation in the GRU. The method performs knowledge transfer during a validation phase that uses the k-nearest neighbors technique (kNN). Experimental results using the Flickr30k and MSCOCO datasets demonstrated that the proposed PoS-based model presents competitive scores in comparison to state-ofthe-art models. The system predicts more descriptive captions and closely approximates the expected captions both in the predicted and kNN selected captions. Keywords Gated recurrent units . Caption generation references . Convolutional neural network
* Tiago do Carmo Nogueira [email protected] Cássio Dener Noronha Vinhal [email protected] Gélson da Cruz Júnior [email protected] Matheus Rudolfo Diedrich Ullmann [email protected]
1
School of Electrical, Mechanical and Computer Engineering (EMC), Federal University of Goiás (UFG), Goiânia, Brazil
Multimedia Tools and Applications
1 Introduction Automatic description of images using natural language sentences has garnered significant attention in the domain of computer vision. Image captioning is a task that describes a specific image through a sequence of words or phrases [29, 39, 53]. Machine learning models can accomplish this task and are capable of encoding objects, semantic attributes, and inferring their relationships [21]. Although it is easy for humans to understand how these relationships occur, it is still a challenging task for machines. One main reason is the necessity to account for the relationships between these objects and recognize the objects contained in an image [19, 40, 52]. Several works in the literature proposed an encoder-decoder structure for accomplishing this task [19, 21, 22, 30, 32, 39, 53]. The encoder–decoder structure uses a convolutional neural network (CNN) as the encoder while extracting image features. A recurrent neural network (RNN) is used as the decoder while generating image descriptions [39, 53]. The use of models that combine global and local imag
Data Loading...