Multi-scale Deep Residual Networks for Fine-Grained Image Classification

Fine-grained image classification aims at distinguishing very similar images, i.e., the subcategories in one class. Compared with generic object recognition, fine-grained image classification is much more challenging due to the small inter-class variance.

  • PDF / 2,693,174 Bytes
  • 13 Pages / 439.37 x 666.14 pts Page_size
  • 73 Downloads / 205 Views

DOWNLOAD

REPORT


1

School of Communication and Information Engineering, Shanghai University, Shanghai 200444, China {wangxiangyang,xqzhu}@shu.edu.cn, [email protected], [email protected], [email protected] 2 School of Electronic and Electrical Engineering, Shanghai University of Engineering Science, Shanghai 201620, China [email protected]

Abstract. Fine-grained image classification aims at distinguishing very similar images, i.e., the subcategories in one class. Compared with generic object recognition, fine-grained image classification is much more chal‐ lenging due to the small inter-class variance. Deep Residual Networks (ResNet) is a recently proposed deep Convolution Neural Networks (CNN) model, and has achieved the excellent performance on image classification. Though powerful, like other contemporary CNN models, ResNet only exploits the features extracted from the last output layer for classification, which may be insufficient for fine-grained classification. In this paper, we propose a Multi-scale Residual Networks (Multi-scale ResNet) to further improve the fine-grained image classification performance. Based on the ResNet model, we extract features from multiple CNN layers, add these highlevel and mid-level features together with different weights for final classi‐ fication. We compare our proposed model with some state-of-the-art models on two fine-grained image dataset, Stanford Cars and Dogs, and experi‐ mental results validate the efficacy of our method. Keywords: Convolution Neural Networks (CNN) · Residual learning · Multiscale residual networks · Fine-grained image classification

1

Introduction

Fine-grained visual recognition is a challenging problem in computer vision community, which aims at recognizing the subcategories in one object class. For instance, it may be practically useful to recognize the dog or bird species, or car models [1, 2, 6]. Such objects are both semantically and visually similar to each other. Fine-grained image classification has attracted much attention in the past few years [3–5]. Usually, fine-grained recognition is more difficult than common image classifi‐ cation. For example, fine-grained categorization of car models (or bird species) would be a more difficult task than distinguishing people from dogs. Fine-grained recognition © Springer Nature Singapore Pte Ltd. 2017 X. Yang and G. Zhai (Eds.): IFTC 2016, CCIS 685, pp. 205–217, 2017. DOI: 10.1007/978-981-10-4211-9_21

206

X. Wang et al.

is quite challenging because visual differences between the categories are small, and the recognition accuracy may be affected by factors such as pose, viewpoint, or location of the object in the image. Convolutional Neural Networks (CNNs). In recent years, deep Convolutional Neural Networks (CNNs) [7, 32] have gained great success in the area of machine learning and computer vision, especially because of their surprising achievements in image classifi‐ cation tasks [8, 9]. From the AlexNet [8] with 8 layers to the VGG net [10] with 16 or 19 layers, the CNN classification accuracy increase