Multi-center convolutional descriptor aggregation for image retrieval

  • PDF / 2,361,500 Bytes
  • 11 Pages / 595.276 x 790.866 pts Page_size
  • 2 Downloads / 175 Views

DOWNLOAD

REPORT


ORIGINAL ARTICLE

Multi-center convolutional descriptor aggregation for image retrieval Jie Zhu1 · Shufang Wu2 · Hong Zhu3 · Yan Li4 · Li Zhao3 Received: 15 May 2018 / Accepted: 26 November 2018 © Springer-Verlag GmbH Germany, part of Springer Nature 2018

Abstract Recent works have demonstrated that the convolutional descriptor aggregation can provide state-of-the-art performance for image retrieval. In this paper, we propose a multi-center convolutional descriptor aggregation (MCDA) method to produce global image representation for image retrieval. We first present a feature map center selection method to eliminate the background information in the feature maps. We then propose the channel weighting and spatial weighting schemes based on the centers to boost the effect of the features on the object. Finally, the weighted convolutional descriptors are aggregated to represent images. Experiments demonstrate that MCDA can produce state-of-the-art retrieval performance, and the generated activation map is also effective for object localization. Keywords  Multi-center · Descriptor aggregation · Feature map · Feature weighting

1 Introduction Image retrieval has been evolving rapidly over the last decade. Many existing methods adopt some low level descriptors, and encode them using bag-of-words (BoW) or some others methods. After the seminal work of Krizhevsky [1], deep learning has demonstrated the advantages in many areas of artificial intelligence [2–5]. Many works have applied pre-trained convolutional neural networks (CNNs) models to extract generic features for image retrieval and obtained excellent performances [5–7]. In all these methods, the activations in the convolutional layers or pooling layers which can capture semantic features are used to represent images. Usually there are three steps, first, the descriptors are extracted and selected, and second, these descriptors are aggregated to represent images. Finally, the retrieval results are obtained by calculating the similarities between images. In addition, the activation map, which is * Shufang Wu [email protected] 1



Department of Information Management, The National Police University for Criminal Justice, Baoding, China

2



College of Management, Hebei University, Baoding, China

3

College of Computer Science and Software Engineering, Shen Zhen University, Shenzhen, China

4

School of Applied Mathematics, Beijing Normal University Zhuhai, Zhuhai, China



generated by summing the feature maps in the same layer, is effective to describe the object region in the image. Although CNN has been successful applied on image retrieval, a few questions still remain unaddressed. First, the positions of the top few highest responses in a CNN activation map usually correspond to different object regions in an image, and previous work [6] also demonstrated that the positions with the top few highest responses in some feature maps also correspond to the object regions. Therefore, it is questionable whether it is best to use the responses in the feature maps