Cross-modal subspace learning via kernel correlation maximization and discriminative structure-preserving

  • PDF / 1,922,395 Bytes
  • 17 Pages / 439.642 x 666.49 pts Page_size
  • 101 Downloads / 170 Views

DOWNLOAD

REPORT


Cross-modal subspace learning via kernel correlation maximization and discriminative structure-preserving Jun Yu1,2 · Xiao-Jun Wu1,2 Received: 23 April 2019 / Revised: 10 December 2019 / Accepted: 23 April 2020 / © Springer Science+Business Media, LLC, part of Springer Nature 2020

Abstract How to measure the distance between heterogeneous data is still an open problem. Many research works have been developed to learn a common subspace where the similarity between different modalities can be calculated directly. However, most of existing works focus on learning a latent subspace but the semantically structural information is not well preserved. Thus, these approaches cannot get desired results. In this paper, we propose a novel framework, termed Cross-modal subspace learning via Kernel correlation maximization and Discriminative structure-preserving (CKD), to solve this problem in two aspects. Firstly, we construct a shared semantic graph to make each modality data preserve the neighbor relationship semantically. Secondly, we introduce the Hilbert-Schmidt Independence Criteria (HSIC) to ensure the consistency between feature-similarity and semantic-similarity of samples. Our model not only considers the inter-modality correlation by maximizing the kernel correlation but also preserves the semantically structural information within each modality. The extensive experiments are performed to evaluate the proposed framework on the three public datasets. The experimental results demonstrate that the proposed CKD is competitive compared with the classic subspace learning methods. Keywords Cross-modal retrieval · Subspace learning · Kernel correlation · Discriminative · HSIC

1 Introduction Recently, the rapid development of the Internet and the explosive growth of multimedia including text, image, video, audio have greatly enriched people’s life but magnified the challenge of information retrieval. Representative image retrieval methods, such as Regionbased image retrieval [43], Color-based image retrieval [4], Contour Points Distribution Histogram(CPDH) [28], Inverse Document Frequency (IDF) [44], Content-based image  Xiao-Jun Wu

wu [email protected] 1

The School of Artificial Intelligence and Computer Science, Jiangnan University, 214122, Wuxi, China

2

The Jiangsu Provincial Engineering Laboratory of Pattern Recognition and Computational Intelligence, Jiangnan University, 214122, Wuxi, China

Multimedia Tools and Applications

retrieval [19], can not directly be applied to multimodal retrieval. Multimodal data refers to those different types of data but with the same semantic content, for example, recording video clips, music, photos and tweets of a concert. Cross-modal retrieval which aims to take one type of data as the query to return the relevant data of another type has attracted much attention. The cross-modal retrieval methods need to solve a basic problem, i.e. how to measure the relevance between heterogeneous modalities. There are two strategies to solve this problem: one is to directly calculate t