Playing for Data: Ground Truth from Computer Games

Recent progress in computer vision has been driven by high-capacity models trained on large datasets. Unfortunately, creating large datasets with pixel-level labels has been extremely costly due to the amount of human effort required. In this paper, we pr

  • PDF / 5,180,305 Bytes
  • 17 Pages / 439.37 x 666.142 pts Page_size
  • 74 Downloads / 218 Views

DOWNLOAD

REPORT


TU Darmstadt, Darmstadt, Germany [email protected] 2 Intel Labs, Santa Clara, USA

Abstract. Recent progress in computer vision has been driven by highcapacity models trained on large datasets. Unfortunately, creating large datasets with pixel-level labels has been extremely costly due to the amount of human effort required. In this paper, we present an approach to rapidly creating pixel-accurate semantic label maps for images extracted from modern computer games. Although the source code and the internal operation of commercial games are inaccessible, we show that associations between image patches can be reconstructed from the communication between the game and the graphics hardware. This enables rapid propagation of semantic labels within and across images synthesized by the game, with no access to the source code or the content. We validate the presented approach by producing dense pixel-level semantic annotations for 25 thousand images synthesized by a photorealistic openworld computer game. Experiments on semantic segmentation datasets show that using the acquired data to supplement real-world images significantly increases accuracy and that the acquired data enables reducing the amount of hand-labeled real-world data: models trained with game data and just 13 of the CamVid training set outperform models trained on the complete CamVid training set.

1

Introduction

Recent progress in computer vision has been driven by high-capacity models trained on large datasets. Image classification datasets with millions of labeled images support training deep and highly expressive models [24]. Following their success in image classification, these models have recently been adapted for detailed scene understanding tasks such as semantic segmentation [28]. Such semantic segmentation models are initially trained for image classification, for which large datasets are available, and then fine-tuned on semantic segmentation datasets, which have fewer images. We are therefore interested in creating very large datasets with pixel-accurate semantic labels. Such datasets may enable the design of more diverse model

S.R. Richter and V. Vineet—Authors contributed equally. c Springer International Publishing AG 2016  B. Leibe et al. (Eds.): ECCV 2016, Part II, LNCS 9906, pp. 102–118, 2016. DOI: 10.1007/978-3-319-46475-6 7

Playing for Data: Ground Truth from Computer Games

103

architectures that are not constrained by mandatory pre-training on image classification. They may also substantially increase the accuracy of semantic segmentation models, which at present appear to be limited by data rather than capacity. (For example, the top-performing semantic segmentation models on the PASCAL VOC leaderboard all use additional external sources of pixelwise labeled data for training.) Creating large datasets with pixelwise semantic labels is known to be very challenging due to the amount of human effort required to trace accurate object boundaries. High-quality semantic labeling was reported to require 60 min per image fo