VLAD Is not Necessary for CNN

Global convolutional neural networks (CNNs) activations lack geometric invariance, and in order to address this problem, Gong et al. proposed multi-scale orderless pooling(MOP-CNN), which extracts CNN activations for local patches at multiple scale levels

  • PDF / 3,621,315 Bytes
  • 8 Pages / 439.37 x 666.142 pts Page_size
  • 105 Downloads / 208 Views

DOWNLOAD

REPORT


Abstract. Global convolutional neural networks (CNNs) activations lack geometric invariance, and in order to address this problem, Gong et al. proposed multi-scale orderless pooling(MOP-CNN), which extracts CNN activations for local patches at multiple scale levels, and performs orderless VLAD pooling to extract features. However, we find that this method can improve the performance mainly because it extracts global and local representation simultaneously, and VLAD pooling is not necessary as the representations extracted by CNN is good enough for classification. In this paper, we propose a new method to extract multi-scale features of CNNs, leading to a new structure of deep learning. The method extracts CNN representations for local patches at multiple scale levels, then concatenates all the representations at each level separately, finally, concatenates the results of all levels. The CNN is trained on the ImageNet dataset to extract features and it is then transferred to other datasets. The experimental results obtained on the databases MITIndoor and Caltech-101 show that the performance of our proposed method is superior to the MOP-CNN. Keywords: CNN

 Multi-scale  Deep learning  VLAD  Transfer learning

1 Introduction Image classification [1–5] is one of the most important research tasks in computer vision and pattern recognition. To choose the right features plays the key role in a recognition system. There are many feature descriptors such as SIFT [6] and HOG [7], but they need to be designed by handcraft carefully, which is time-consuming and may not get the best feature sometimes. Many researches show that the features of the best performing recognition models are learned unsupervisedly from raw data. Recently, deep convolutional neural networks (CNNs) have been considered as a powerful class of models for image recognition problems [8–11]. The feature representation learned by these networks achieves state-of-the-art performance not only on the task for which the network was trained, but also on various other classification tasks. A lot of recent works [12–14] showed that the feature representation trained on a large dataset can be successfully transferred to other visual tasks. For example: classification on Catech-101 [15], Catech-256 [5]; scene recognition on the Pascal VOC 2007 and 2012 [12] databases and so on. However, global CNN activations lack geometric invariance, which limit their performance for the task of high variable scenes. Gong et al. [16] proposed a simple © Springer International Publishing Switzerland 2016 G. Hua and H. Jégou (Eds.): ECCV 2016 Workshops, Part III, LNCS 9915, pp. 492–499, 2016. DOI: 10.1007/978-3-319-49409-8_41

VLAD Is not Necessary for CNN

493

scheme called multi-scale orderless pooling CNN (MOP-CNN) to solve this problem, which combining activations extracted at multiple local image windows. The main idea of MOP-CNN is extracting features from the local patches via CNN at multiple scales, then adopting Vectors of Locally Aggregated Descriptors (VLAD) [17, 18] to en