Deep attentional fine-grained similarity network with adversarial learning for cross-modal retrieval
- PDF / 9,050,668 Bytes
- 28 Pages / 439.37 x 666.142 pts Page_size
- 17 Downloads / 176 Views
Deep attentional fine-grained similarity network with adversarial learning for cross-modal retrieval Qingrong Cheng 1
& Xiaodong Gu
1
Received: 12 October 2019 / Revised: 24 June 2020 / Accepted: 28 July 2020 # Springer Science+Business Media, LLC, part of Springer Nature 2020
Abstract
People have witnessed the swift development of multimedia devices and multimedia technologies in recent years. How to catch interesting and highly relevant information from the magnanimous multimedia data becomes an urgent and challenging matter. To obtain more accurate retrieval results, researchers naturally think of using more finegrained features to evaluate the similarity among multimedia samples. In this paper, we propose a Deep Attentional Fine-grained Similarity Network (DAFSN) for cross-modal retrieval, which is optimized in an adversarial learning manner. The DAFSN model consists of two subnetworks, attentional fine-grained similarity network for aligned representation learning and modal discriminative network. The front subnetwork adopts Bi-directional Long Short-Term Memory (LSTM) and pre-trained Inception-v3 model to extract text features and image features. In aligned representation learning, we consider not only the sentence-level pair-matching constraint but also the fine-grained similarity between word-level features of text description and sub-regional features of an image. The modal discriminative network aims to minimize the “heterogeneity gap” between text features and image features in an adversarial manner. We do experiments on several widely used datasets to verify the performance of the proposed DAFSN. The experimental results show that the DAFSN obtains better retrieval results based on the MAP metric. Besides, the result analyses and visual comparisons are presented in the experimental section. Keywords Attention mechanism . Cross-modal retrieval . Bidirectional LSTM . Fine-grained similarity
* Xiaodong Gu [email protected]
1
Department of Electronic Engineering, Fudan University, Shanghai 200433, China
Multimedia Tools and Applications
1 Introduction With the popularity of mobile devices and the development of next-generation communication technologies, such as 5G, multimedia data, such as image, text, and voice, have grown at an explosive rate. How to effectively acquire the content of interest from massive data has gradually become a challenging issue. In recent decades, information retrieval has always been a prevalent research direction, such as searching for text by image query or searching for images by image query. Digital information, such as image, text, and voice, are data of different modalities but contain similar semantic information. Single modal retrieval [3, 10, 33] is searching for information of interest in a uniform modality, such as retrieving images by image or retrieving texts by text. In earlier retrieval techniques, these algorithms mainly focus on single-modal retrieval problems. Unlike the single-modal retrieval method, the cross-modal retrieval methods [13, 43] aim to
Data Loading...