Mix and Match Networks: Cross-Modal Alignment for Zero-Pair Image-to-Image Translation

  • PDF / 3,472,974 Bytes
  • 24 Pages / 595.276 x 790.866 pts Page_size
  • 25 Downloads / 158 Views

DOWNLOAD

REPORT


Mix and Match Networks: Cross‑Modal Alignment for Zero‑Pair Image‑to‑Image Translation Yaxing Wang1   · Luis Herranz1   · Joost van de Weijer1  Received: 1 March 2019 / Accepted: 12 May 2020 © Springer Science+Business Media, LLC, part of Springer Nature 2020

Abstract This paper addresses the problem of inferring unseen cross-modal image-to-image translations between multiple modalities. We assume that only some of the pairwise translations have been seen (i.e. trained) and infer the remaining unseen translations (where training pairs are not available). We propose mix and match networks, an approach where multiple encoders and decoders are aligned in such a way that the desired translation can be obtained by simply cascading the source encoder and the target decoder, even when they have not interacted during the training stage (i.e. unseen). The main challenge lies in the alignment of the latent representations at the bottlenecks of encoder–decoder pairs. We propose an architecture with several tools to encourage alignment, including autoencoders and robust side information and latent consistency losses. We show the benefits of our approach in terms of effectiveness and scalability compared with other pairwise image-to-image translation approaches. We also propose zero-pair cross-modal image translation, a challenging setting where the objective is inferring semantic segmentation from depth (and vice-versa) without explicit segmentation-depth pairs, and only from two (disjoint) segmentation-RGB and depth-RGB training sets. We observe that a certain part of the shared information between unseen modalities might not be reachable, so we further propose a variant that leverages pseudo-pairs which allows us to exploit this shared information between the unseen modalities. Keywords  Image-to-image translation · Multi-domain · Multi-modal · Feature alignment · Mix and match networks · Zero-pair translation · Semantic segmentation · Depth estimation · Deep learning

1 Introduction For many computer vision applications, the task is to estimate a mapping between an input image and an output image. This family of methods is often known as imageto-image translations (image translations hereinafter). They include transformations between different modalities, such as from RGB to depth (Liu et al. 2016), or domains, such as luminance to color images (Zhang et al. 2016), or editing Communicated by Chen Change Loy. * Yaxing Wang [email protected] Luis Herranz [email protected] Joost van de Weijer [email protected] 1



The Computer Vision Center Barcelona, Edifici O, Campus UAB, 08193 Bellaterra, Spain

operations such as artistic style changes (Gatys et al. 2016). These mappings can also include other 2D representations such as semantic segmentations (Long et al. 2015) or surface normals (Eigen and Fergus 2015). One drawback of the initial research on image translations is that the methods required paired data to train the mapping between the domains (Long et al. 2015; Eigen and Fergus 2015; Isola et al. 2017). Another class o