Clockwork Convnets for Video Semantic Segmentation

Recent years have seen tremendous progress in still-image segmentation; however the naïve application of these state-of-the-art algorithms to every video frame requires considerable computation and ignores the temporal continuity inherent in video. We pro

  • PDF / 2,399,243 Bytes
  • 17 Pages / 439.37 x 666.142 pts Page_size
  • 65 Downloads / 254 Views

DOWNLOAD

REPORT


Abstract. Recent years have seen tremendous progress in still-image segmentation; however the na¨ıve application of these state-of-the-art algorithms to every video frame requires considerable computation and ignores the temporal continuity inherent in video. We propose a video recognition framework that relies on two key observations: (1) while pixels may change rapidly from frame to frame, the semantic content of a scene evolves more slowly, and (2) execution can be viewed as an aspect of architecture, yielding purpose-fit computation schedules for networks. We define a novel family of “clockwork” convnets driven by fixed or adaptive clock signals that schedule the processing of different layers at different update rates according to their semantic stability. We design a pipeline schedule to reduce latency for real-time recognition and a fixed-rate schedule to reduce overall computation. Finally, we extend clockwork scheduling to adaptive video processing by incorporating datadriven clocks that can be tuned on unlabeled video. The accuracy and efficiency of clockwork convnets are evaluated on the Youtube-Objects, NYUD, and Cityscapes video datasets.

1

Introduction

Semantic segmentation is a central visual recognition task. End-to-end convolutional network approaches have made progress on the accuracy and execution time of still-image semantic segmentation, but video semantic segmentation has received less attention. Potential applications include UAV navigation, autonomous driving, archival footage recognition, and wearable computing. The computational demands of video processing are a challenge to the simple application of image methods on every frame, while the temporal continuity of video offers an opportunity to reduce this computation. Fully convolutional networks (FCNs) [1–3] have been shown to obtain remarkable results, but the execution time of repeated per-frame processing limits application to video. Adapting these networks to make use of the temporal E. Shelhamer et al.—Authors contributed equally. Electronic supplementary material The online version of this chapter (doi:10. 1007/978-3-319-49409-8 69) contains supplementary material, which is available to authorized users. c Springer International Publishing Switzerland 2016  G. Hua and H. J´ egou (Eds.): ECCV 2016 Workshops, Part III, LNCS 9915, pp. 852–868, 2016. DOI: 10.1007/978-3-319-49409-8 69

Clockwork Convnets for Video Semantic Segmentation

853

Fig. 1. Our adaptive clockwork method illustrated with the famous The Horse in Motion [9], captured by Eadweard Muybridge in 1878 at the Palo Alto racetrack. The clock controls network execution: past the first stage, computation is scheduled only at the time points indicated by the clock symbol. During static scenes cached representations persist, while during dynamic scenes new computations are scheduled and output is combined with cached representations.

continuity of video reduces inference computation while suffering minimal loss in recognition accuracy. The temporal rate of change of features, or featu