Autoencoder-based self-supervised hashing for cross-modal retrieval

PDF / 1,353,909 Bytes
18 Pages / 439.642 x 666.49 pts Page_size
54 Downloads / 305 Views

Autoencoder-based self-supervised hashing for cross-modal retrieval Yifan Li1 · Xuan Wang1 · Lei Cui1 · Jiajia Zhang1 · Chengkai Huang1 · Xuan Luo1 · Shuhan Qi1 Received: 18 December 2019 / Revised: 17 July 2020 / Accepted: 12 August 2020 / © Springer Science+Business Media, LLC, part of Springer Nature 2020

Abstract Cross-modal retrieval has gained lots of attention in the era of the multimedia data explosion. Taking advantage of low storage cost and fast retrieval speed, hash learning-based methods become more and more popular in this field. The crucial bottlenecks of crossmodal retrieval are twofold: the heterogeneous gap in different modalities and the semantic gap among similar data with various modalities. To address these issues, we adopt selfsupervised fashion to bridge the heterogeneous gap by generating the cohesive features of different instances. To mitigate the semantic gap, we use triplet sampling to optimize the semantic loss in inter-modal and intra-modal, which increase the discriminability of our approach. Experimental on two benchmark datasets show the efficiency and robustness of our method, and the extended experiments show the scalability. Keywords Cross-modal retrieval · Hash learning · Autoencoder · Self-supervised

Shuhan Qi

[email protected] Yifan Li [email protected] Xuan Wang [email protected] Lei Cui [email protected] Jiajia Zhang [email protected] Chengkai Huang [email protected] Xuan Luo [email protected] 1

Computer Science and Technology, Harbin Institute of Technology (Shenzhen), Shenzhen, China

Multimedia Tools and Applications

1 Introduction For humanity, it is instinctive to associate an image or a video clip with other relevant sentences or audios. This mechanism of our brain helps us understand and remember the new things more comprehensively from multi-source. But in recent years, the mobile internet speeds up the spread of information and it is not practical for one to follow all the anecdotes. So how to use the search engines to preciously catch the clues what we really need and interested is very important. For instance, a man wants to get more information about a cat which he only has a picture, he can use cross-modal retrieval to get the descriptions and videos related to this cat easily. Cross-modal retrieval is a method to get similar data with different modalities. It is an important application for intelligent multimedia, and it can help us to take full advantage of the vast multimedia data. Subspace learning-based method is widely used in cross-modal retrieval, as shown in Fig. 1, including statistical correlation [1, 24, 25] and modal regularization [11, 29, 32]. However, these methods always need to maintain a large database to store the real-valued vectors of every instances, which need lots of memory and huge computational resources for searching procedure. With the rapid searching speed and lower storage cost, hash learningbased methods get much more attention in cross-modal retrieval. Jiang et al. [15] propose the de

Data Loading...

Autoencoder-based self-supervised hashing for cross-modal retrieval

Recommend Documents

Deep hashing for multi-label image retrieval: a survey

Kernel-Based Supervised Discrete Hashing for Image Retrieval

Semi-supervised discrete hashing for efficient cross-modal retrieval

Hashing

Supervised deep semantics-preserving hashing for real-time pulmonary nodule image retrieval

Hashing Forests for Morphological Search and Retrieval in Neuroscientific Image Databases

Sensitivity based image filtering for multi-hashing in large scale image retrieval problems

An efficient retrieval approach for encrypted speech based on biological hashing and spectral subtraction

Semantics-preserving hashing based on multi-scale fusion for cross-modal retrieval

A Novel Unsupervised Hashing Method for Image Retrieval Based on K-Reciprocal Nearest Neighbors

A retrieval algorithm for encrypted speech based on convolutional neural network and deep hashing

ExchNet: A Unified Hashing Network for Large-Scale Fine-Grained Image Retrieval