Remote sensing image caption generation via transformer and reinforcement learning
- PDF / 2,170,712 Bytes
- 22 Pages / 439.642 x 666.49 pts Page_size
- 93 Downloads / 257 Views
Remote sensing image caption generation via transformer and reinforcement learning Xiangqing Shen1
· Bing Liu1,2,3 · Yong Zhou1,2 · Jiaqi Zhao2
Received: 18 July 2019 / Revised: 24 June 2020 / Accepted: 29 June 2020 / © Springer Science+Business Media, LLC, part of Springer Nature 2020
Abstract Image captioning is a task generating the natural semantic description of the given image, which plays an essential role for machines to understand the content of the image. Remote sensing image captioning is a part of the field. Most of the current remote sensing image captioning models failed to fully utilize the semantic information in images and suffered the overfitting problem induced by the small size of the dataset. To this end, we propose a new model using the Transformer to decode the image features to target sentences. For making the Transformer more adaptive to the remote sensing image captioning task, we additionally employ dropout layers, residual connections, and adaptive feature fusion in the Transformer. Reinforcement Learning is then applied to enhance the quality of the generated sentences. We demonstrate the validity of our proposed model on three remote sensing image captioning datasets. Our model obtains all seven higher scores on the Sydney Dataset and Remote Sensing Image Caption Dataset (RSICD), four higher scores on UCM dataset, which indicates that the proposed methods perform better than the previous state of the art models in remote sensing image caption generation. Keywords Transformer · Remote sensing image captioning · Attention mechanisms · Convolutional neural network · Reinforcement learning
1 Introduction Recently, significant progress has been made in modern remote sensing technologies due to the improvements in sensors and computing power. It can help people access geospatial information more easily. There are a variety of areas using remote sensing images, such Bing Liu
[email protected] 1
School of Computer Science and Technology, China University of Mining and Technology, Xuzhou, 221116, Jiangsu province, China
2
Mine Digitization Engineering Research Center of Ministry of Education of the People’s Republic of China, Xuzhou, China
3
Insititute of Electrics, Chinese Academy of Sciences, Beijing, 100190, China
Multimedia Tools and Applications
as national census, water conservancy construction, oil exploration, map mapping, railway, and highway location [37, 38, 47]. We usually access remote sensing images of high resolution nowadays, and deep neural networks are gradually being applied to the analysis of it. The deep neural networks achieved satisfactory results when used for classifying the scene [8, 30, 31] and detecting the object [16, 66]. Despite the successful application of deep neural networks in remote sensing images, we can see the existing research usually attaches more importance to the feature level of remote sensing images and failed to capture semantic level information and correlations of different objects in the images which are also crucial to the bett
Data Loading...