Semantics-preserving hashing based on multi-scale fusion for cross-modal retrieval

  • PDF / 1,206,876 Bytes
  • 16 Pages / 439.37 x 666.142 pts Page_size
  • 11 Downloads / 194 Views

DOWNLOAD

REPORT


Semantics-preserving hashing based on multi-scale fusion for cross-modal retrieval Hong Zhang 1,2 & Min Pan 1,2 Received: 13 March 2020 / Revised: 18 August 2020 / Accepted: 11 September 2020 # Springer Science+Business Media, LLC, part of Springer Nature 2020

Abstract

Research on hash-based cross-modal retrieval has been a hotspot in the field of content-based multimedia retrieval research. Most deep cross-modal hashing methods only consider intermodal loss that can remain local information of training data, and ignore the loss within data samples of the same modality that can remain the global information of dataset. In addition, they also ignore the factor that different scales of single modal data contain different semantic information, which affects the representation of data features. In this paper, we propose a semantics-preserving hashing method based on multi-scale fusion. More concretely, a multiscale fusion pooling model is proposed for both image feature training network and text feature training network. Therefore, we can extract the multi-scale features of image dataset and solve the sparsity problem of text BOW vectors. When constructing the loss function, we consider intra-modal loss while considering inter-modal loss. Therefore, the output hash code retains both global and local underlying semantic correlation when image and text feature training network are trained. Experiment results on NUS-WIDE and MIRFlickr-25 K prove that against other existing methods, our algorithm improves cross-modal retrieval accuracy. Keywords Cross-modal retrieval . Multi-scale fusion . Hash learning . Semantics preserving . Deep learning

1 Introduction Development in information technology has led to explosive growth of multimedia data. At the same time, people’s demand for information search to obtain diverse results is increasing.

* Hong Zhang [email protected]

1

College of Computer Science & Technology, Wuhan University of Science & Technology, Wuhan 430081, China

2

Hubei Province Key Laboratory of Intelligent Information Processing and Real-time Industrial System, Wuhan, China

Multimedia Tools and Applications

Therefore, there are more and more researches [18, 23, 33, 19, 31, 20, 32, 15] on multimedia data analysis and cross-modal retrieval technology. Cross-modal retrieval is point to all relevant data of other modalities are accurately and quickly retrieved through the data of one modal. Hash learning is widely used in cross-modal retrieval models [27, 21, 29, 1], because of its good low storage and efficient retrieval. In the past few decades of research, there are many hash methods for single-modal retrieval [25, 22, 16, 14, 8, 13, 35]. However, these methods are not suitable for cross-modal hash retrieval, because of the semantic gap between data in different modalities. Most existing cross-modal retrieval hashing methods [34] solve semantic gaps by mining the correlations of different modal data. The main cross-modal hashing methods can be divided into two categories: deep cross-modal hashing