Learning efficient text-to-image synthesis via interstage cross-sample similarity distillation

  • PDF / 2,126,961 Bytes
  • 12 Pages / 595 x 842 pts (A4) Page_size
  • 16 Downloads / 163 Views

DOWNLOAD

REPORT


. RESEARCH PAPER .

February 2021, Vol. 64 120102:1–120102:12 https://doi.org/10.1007/s11432-020-2900-x

Special Focus on Deep Learning for Computer Vision

Learning efficient text-to-image synthesis via interstage cross-sample similarity distillation Fengling MAO1,2 , Bingpeng MA3* , Hong CHANG2,3 , Shiguang SHAN2,3,4 & Xilin CHEN2,3 1

School of Information Science and Technology, ShanghaiTech University, Shanghai 201210, China; 2 Key Laboratory of Intelligent Information Processing of Chinese Academy of Sciences (CAS), Institute of Computing Technology, Chinese Academy of Sciences, Beijing 100190, China; 3 University of Chinese Academy of Sciences, Beijing 100049, China; 4 CAS Center for Excellence in Brain Science and Intelligence Technology, Shanghai 200031, China

Received 21 January 2020/Revised 8 March 2020/Accepted 26 April 2020/Published online 17 November 2020

Abstract For a given text, previous text-to-image synthesis methods commonly utilize a multistage generation model to produce images with high resolution in a coarse-to-fine manner. However, these methods ignore the interaction among stages, and they do not constrain the consistent cross-sample relations of images generated in different stages. These deficiencies result in inefficient generation and discrimination. In this study, we propose an interstage cross-sample similarity distillation model based on a generative adversarial network (GAN) for learning efficient text-to-image synthesis. To strengthen the interaction among stages, we achieve interstage knowledge distillation from the refined stage to the coarse stages with novel interstage cross-sample similarity distillation blocks. To enhance the constraint on the cross-sample relations of the images generated at different stages, we conduct cross-sample similarity distillation among the stages. Extensive experiments on the Oxford-102 and Caltech-UCSD Birds-200-2011 (CUB) datasets show that our model generates visually pleasing images and achieves quantitatively comparable performance with state-of-the-art methods. Keywords

generative adversarial network (GAN), text-to-image synthesis, knowledge distillation

Citation Mao F L, Ma B P, Chang H, et al. Learning efficient text-to-image synthesis via interstage cross-sample similarity distillation. Sci China Inf Sci, 2021, 64(2): 120102, https://doi.org/10.1007/s11432-020-2900-x

1

Introduction

Image generation [1–3] has achieved remarkable progress owing to the flourishing development of deep learning. Many applications of image generation [4–11], such as style-transfer [5], video generation [6], image-to-image translation [8,9], image inpainting [7], and text-to-image synthesis [12–17], have attracted increasing attention. For a given text, the text-to-image synthesis task aims at producing images that are of high quality and semantically consistent with the given text. Serveral methods [12–17] for text-to-image synthesis have been proposed. Reed et al. [12] proposed the classic single-stage generative adversarial network (GAN) framework bas