CAESAR: concept augmentation based semantic representation for cross-modal retrieval

  • PDF / 4,446,787 Bytes
  • 31 Pages / 439.642 x 666.49 pts Page_size
  • 49 Downloads / 222 Views

DOWNLOAD

REPORT


CAESAR: concept augmentation based semantic representation for cross-modal retrieval Lei Zhu1,2 · Jiayu Song1 · Xiangxiang Wei1,2 · Hao Yu1 · Jun Long1,2 Received: 31 December 2019 / Revised: 4 June 2020 / Accepted: 24 September 2020 / © Springer Science+Business Media, LLC, part of Springer Nature 2020

Abstract With the increasing amount of multimedia data, cross-modal retrieval has attracted more attentions in the area of multimedia and computer vision. To bridge the semantic gap between multi-modal data and improve the performance of retrieval, we propose an effective concept augmentation based method, named CAESAR, which is an end-to-end framework including cross-modal correlation learning and concept augmentation based semantic mapping learning. To enhance the representation and correlation learning, a novel multi-modal CNNs based CCA model is developed, which is to capture high-level semantic information during the cross-modal feature learning, and then capture maximal nonlinear correlation. In addition, to learn the semantic relationships between multi-modal samples, a concept learning model named CaeNet is proposed, which is realized by word2vec and LDA to capture the closer relations between texts and abstract concepts. Reenforce by the abstract concept information, cross-modal semantic mappings are learnt with a semantic alignment strategy. We conduct comprehensive experiments on four benchmark multimedia datasets. The results show that our method has great performance for cross-modal retrieval. Keywords Cross-modal retrieval · Deep learning · Multi-modal representation learning · Concept augmentation  Hao Yu

[email protected]  Jun Long

[email protected] Lei Zhu [email protected] Jiayu Song [email protected] Xiangxiang Wei [email protected] 1

School of Computer Science and Engineering, Central South University, Changsha, People’s Republic of China

2

Big Data and Knowledge Engineering Institute, Central South University, Changsha, People’s Republic of China

Multimedia Tools and Applications

1 Introduction With the rapid development of mobile Internet technology and the wide application of multimedia services, mass amounts of multimedia data such as images, texts, audios and videos have been generated, collected by smart devices, and stored and shared on Internet. For example, as shown in Fig. 1, Wikipedia, one of the largest collaborative online encyclopedia, shares 40 million articles with images in 301 different languages. More than 400 million tweets with texts and images have been generated by 140 million users on Twitter, which is the most popular online social network platform. Another famous social networking service, Facebook, reports 350 million photos uploaded daily as of November 2013. For multimedia sharing services, more than 3.5 million new pictures with textual descriptions uploaded to Flickr daily in March 2013. More than 1.9 billion users every month log into YouTube, the largest video sharing web site, which stores more than 2 billion videos with descriptions. Other