Autoencoder-based self-supervised hashing for cross-modal retrieval
- PDF / 1,353,909 Bytes
- 18 Pages / 439.642 x 666.49 pts Page_size
- 54 Downloads / 226 Views
Autoencoder-based self-supervised hashing for cross-modal retrieval Yifan Li1 · Xuan Wang1 · Lei Cui1 · Jiajia Zhang1 · Chengkai Huang1 · Xuan Luo1 · Shuhan Qi1 Received: 18 December 2019 / Revised: 17 July 2020 / Accepted: 12 August 2020 / © Springer Science+Business Media, LLC, part of Springer Nature 2020
Abstract Cross-modal retrieval has gained lots of attention in the era of the multimedia data explosion. Taking advantage of low storage cost and fast retrieval speed, hash learning-based methods become more and more popular in this field. The crucial bottlenecks of crossmodal retrieval are twofold: the heterogeneous gap in different modalities and the semantic gap among similar data with various modalities. To address these issues, we adopt selfsupervised fashion to bridge the heterogeneous gap by generating the cohesive features of different instances. To mitigate the semantic gap, we use triplet sampling to optimize the semantic loss in inter-modal and intra-modal, which increase the discriminability of our approach. Experimental on two benchmark datasets show the efficiency and robustness of our method, and the extended experiments show the scalability. Keywords Cross-modal retrieval · Hash learning · Autoencoder · Self-supervised
Shuhan Qi
[email protected] Yifan Li [email protected] Xuan Wang [email protected] Lei Cui [email protected] Jiajia Zhang [email protected] Chengkai Huang [email protected] Xuan Luo [email protected] 1
Computer Science and Technology, Harbin Institute of Technology (Shenzhen), Shenzhen, China
Multimedia Tools and Applications
1 Introduction For humanity, it is instinctive to associate an image or a video clip with other relevant sentences or audios. This mechanism of our brain helps us understand and remember the new things more comprehensively from multi-source. But in recent years, the mobile internet speeds up the spread of information and it is not practical for one to follow all the anecdotes. So how to use the search engines to preciously catch the clues what we really need and interested is very important. For instance, a man wants to get more information about a cat which he only has a picture, he can use cross-modal retrieval to get the descriptions and videos related to this cat easily. Cross-modal retrieval is a method to get similar data with different modalities. It is an important application for intelligent multimedia, and it can help us to take full advantage of the vast multimedia data. Subspace learning-based method is widely used in cross-modal retrieval, as shown in Fig. 1, including statistical correlation [1, 24, 25] and modal regularization [11, 29, 32]. However, these methods always need to maintain a large database to store the real-valued vectors of every instances, which need lots of memory and huge computational resources for searching procedure. With the rapid searching speed and lower storage cost, hash learningbased methods get much more attention in cross-modal retrieval. Jiang et al. [15] propose the de
Data Loading...