Online Adaptation for Joint Scene and Object Classification

Recent efforts in computer vision consider joint scene and object classification by exploiting mutual relationships (often termed as context) between them to achieve higher accuracy. On the other hand, there is also a lot of interest in online adaptation

  • PDF / 1,826,512 Bytes
  • 17 Pages / 439.37 x 666.142 pts Page_size
  • 21 Downloads / 263 Views

DOWNLOAD

REPORT


Abstract. Recent efforts in computer vision consider joint scene and object classification by exploiting mutual relationships (often termed as context) between them to achieve higher accuracy. On the other hand, there is also a lot of interest in online adaptation of recognition models as new data becomes available. In this paper, we address the problem of how models for joint scene and object classification can be learned online. A major motivation for this approach is to exploit the hierarchical relationships between scenes and objects, represented as a graphical model, in an active learning framework. To select the samples on the graph, which need to be labeled by a human, we use an information theoretic approach that reduces the joint entropy of scene and object variables. This leads to a significant reduction in the amount of manual labeling effort for similar or better performance when compared with a model trained with the full dataset. This is demonstrated through rigorous experimentation on three datasets.

Keywords: Scene classification

1

· Object detection · Active learning

Introduction

Scene classification and object detection are two challenging problems in computer vision due to high intra-class variance, illumination changes, background clutter and occlusion. Most existing methods assume that data will be labeled and available beforehand in order to train the classification models. It becomes infeasible and unrealistic to know all the labels beforehand with the huge corpus of visual data being generated on a daily basis. Moreover, adaptability of the models to the incoming data is crucial too for long-term performance guarantees. Currently, the big datasets (e.g. ImageNet [1], SUN [2]) are prepared with intensive human labeling, which is difficult to scale up as more and more new images are generated. So, we want to pose a question, ‘Are all the samples equally important to manually label and learn a model from? ’. We address this question in the context of joint scene and object classification. Electronic supplementary material The online version of this chapter (doi:10. 1007/978-3-319-46484-8 14) contains supplementary material, which is available to authorized users. c Springer International Publishing AG 2016  B. Leibe et al. (Eds.): ECCV 2016, Part VIII, LNCS 9912, pp. 227–243, 2016. DOI: 10.1007/978-3-319-46484-8 14

228

J.H. Bappy et al.

Fig. 1. This figure presents the motivation of incorporating relationship among scene and object samples within an image. Here, scene (S) and objects (O1 , O2 , . . . , O 6 ) are predicted by our initial classifier and detectors with some uncertainty. We formulate a graph exploiting scene-object (S-O) and object-object (O-O) relationships. As shown in the figure, even though {S, O2 , O3 , O4 , O5 , O6 } nodes have high uncertainty, manually labeling only 3 of them is good enough to reduce the uncertainty of all the nodes if S-O and O-O relationships are considered. So, the manual labeling cost can be significantly reduced by our proposed approach.

Active learning [3