A unified cycle-consistent neural model for text and image retrieval
- PDF / 2,592,208 Bytes
- 25 Pages / 439.642 x 666.49 pts Page_size
- 81 Downloads / 245 Views
A unified cycle-consistent neural model for text and image retrieval Marcella Cornia1
· Lorenzo Baraldi1 · Hamed R. Tavakoli2 · Rita Cucchiara1
Received: 3 May 2019 / Revised: 30 April 2020 / Accepted: 24 June 2020 / © Springer Science+Business Media, LLC, part of Springer Nature 2020
Abstract Text-image retrieval has been recently becoming a hot-spot research field, thanks to the development of deeply-learnable architectures which can retrieve visual items given textual queries and vice-versa. The key idea of many state-of-the-art approaches has been that of learning a joint multi-modal embedding space in which text and images could be projected and compared. Here we take a different approach and reformulate the problem of text-image retrieval as that of learning a translation between the textual and visual domain. Our proposal leverages an end-to-end trainable architecture that can translate text into image features and vice versa and regularizes this mapping with a cycle-consistency criterion. Experimental evaluations for text-to-image and image-to-text retrieval, conducted on small, medium and large-scale datasets show consistent improvements over the baselines, thus confirming the appropriateness of using a cycle-consistent constrain for the text-image matching task. Keywords Text-image cross retrieval · Cycle-consistency · Visual-semantic models
1 Introduction Matching visual data and natural language is a challenging problem in multimedia. It is a crucial step towards machine intelligence and thus a hot research topic as it facilitates Marcella Cornia
[email protected] Lorenzo Baraldi [email protected] Hamed R. Tavakoli hamed.rezazadegan [email protected] Rita Cucchiara [email protected] 1
Department of Engineering “Enzo Ferrari”, University of Modena and Reggio Emilia, Modena, Italy
2
Nokia Technologies, Espoo, Finland
Multimedia Tools and Applications
a vast range of different applications, including, retrieval [9, 12, 15, 45], visual question answering [31, 35, 51, 55], and image and video captioning [1, 10, 23, 37, 42, 59]. Text-image cross retrieval is one of the core challenges in this regard. The task concerns the retrieval of visual items given textual queries and vice versa, and can be casted as a ranking problem, for which the correct item should be closer to the query than to any other element in the dataset (Fig. 1). Since visual and textual data belong to two distinct modalities, previous methods have often relied on the construction of a common multi-modal embedding space [15, 25, 33, 56], with learnable functions to project data from the two modalities in the joint embedding. Retrieval, in this case, is then carried out by measuring distances in the joint space, which should be low for matching text-image pairs and large for non-matching pairs. Despite approaches based on a common visual-semantic embedding have led to state-ofthe-art results, we here foresee and investigate a different approach. Specifically, we take the problem of retrieving images and c
Data Loading...