Densifying Supervision for Fine-Grained Visual Comparisons

  • PDF / 7,328,319 Bytes
  • 27 Pages / 595.276 x 790.866 pts Page_size
  • 14 Downloads / 188 Views

DOWNLOAD

REPORT


Densifying Supervision for Fine-Grained Visual Comparisons Aron Yu1

· Kristen Grauman1

Received: 1 May 2019 / Accepted: 23 May 2020 © Springer Science+Business Media, LLC, part of Springer Nature 2020

Abstract Detecting subtle differences in visual attributes requires inferring which of two images exhibits a property more, e.g., which face is smiling slightly more, or which shoe is slightly more sporty. While valuable for applications ranging from biometrics to online shopping, fine-grained attributes are challenging to learn. Unlike traditional recognition tasks, the supervision is inherently comparative. Thus, the space of all possible training comparisons is vast, and learning algorithms face a sparsity of supervision problem: it is difficult to curate adequate subtly different image pairs for each attribute of interest. We propose to overcome this problem by densifying the space of training images with attribute-conditioned image generation. The main idea is to create synthetic but realistic training images exhibiting slight modifications of attribute(s), obtain their comparative labels from human annotators, and use the labeled image pairs to augment real image pairs when training ranking functions for the attributes. We introduce two variants of our idea. The first passively synthesizes training images by “jittering” individual attributes in real training images. Building on this idea, our second model actively synthesizes training image pairs that would confuse the current attribute model, training both the attribute ranking functions and a generation controller simultaneously in an adversarial manner. For both models, we employ a conditional Variational Autoencoder (CVAE) to perform image synthesis. We demonstrate the effectiveness of bootstrapping imperfect image generators to counteract supervision sparsity in learning-to-rank models. Our approach yields state-of-the-art performance for challenging datasets from two distinct domains. Keywords Fine-grained · Ranking · Image generation · Relative attributes

1 Introduction Attributes are visual properties describable in words, capturing anything from material properties (metallic, furry), shapes (flat, boxy), expressions (smiling, surprised), to functions (sittable, drinkable). Since their introduction to the recognition community (Farhadi et al. 2009; Kumar et al. 2008; Lampert et al. 2009), attributes have inspired a number of useful applications in image search (Cai et al. 2015; Kovashka and Grauman 2013; Kovashka et al. 2012; Kumar et al. 2008; Siddiquie et al. 2011), biometrics (Chen et al. 2013; Kalayeh et al. 2017; Reid and Nixon 2014), and language-based supervision for recognition (Biswas and Communicated by Jun-Yan Zhu, Hongsheng Li, Eli Shechtman, MingYu Liu, Jan Kautz, Antonio Torralba.

B

Aron Yu [email protected] Kristen Grauman [email protected]

1

Parikh 2013; Demirel et al. 2017; Lampert et al. 2009; Parikh and Grauman 2011; Shrivastava et al. 2012; Yao et al. 2017). Existing attribute models come in one of two forms: categorical