Point-Wise Mutual Information-Based Video Segmentation with High Temporal Consistency

In this paper, we tackle the problem of temporally consistent boundary detection and hierarchical segmentation in videos. While finding the best high-level reasoning of region assignments in videos is the focus of much recent research, temporal consistenc

  • PDF / 4,961,116 Bytes
  • 15 Pages / 439.37 x 666.142 pts Page_size
  • 102 Downloads / 204 Views

DOWNLOAD

REPORT


Abstract. In this paper, we tackle the problem of temporally consistent boundary detection and hierarchical segmentation in videos. While finding the best high-level reasoning of region assignments in videos is the focus of much recent research, temporal consistency in boundary detection has so far only rarely been tackled. We argue that temporally consistent boundaries are a key component to temporally consistent region assignment. The proposed method is based on the point-wise mutual information (PMI) of spatio-temporal voxels. Temporal consistency is established by an evaluation of PMI-based point affinities in the spectral domain over space and time. Thus, the proposed method is independent of any optical flow computation or previously learned motion models. The proposed low-level video segmentation method outperforms the learning-based state of the art in terms of standard region metrics.

1

Introduction

Accurate video segmentation is an important step in many high-level computer vision tasks. It can provide for example window proposals for object detection [12,27] or action tubes for action recognition [13,14]. One of the key challenges in video segmentation is on handling the large amount of data. Traditionally, methods either build upon some fine-grained image segmentation [2] or supervoxel [33] method [9,10,21,22,35] or they consist in the grouping of priorly computed point trajectories (e.g. [19,24]) and transform them in a postprocessing step into dense segmentations [25]. The latter is well suited for motion segmentation applications, but has general issues with segmenting non-moving, or only slightly moving, objects. Indeed, image segmentation into small segments forms the basis for many high-level video segmentation methods like [9,22,35]. A key question when employing such preprocessing is the error it introduces. While state-of-the-art image segmentation methods [2,3,18] offer highly precise boundary localization, they usually suffer from low temporal consistency, i.e.,the superpixel shapes and sizes can change drastically from one frame to the next. This causes undesired flickering effects in high-level segmentation methods. In this paper, we present a low-level video segmentation method that aims at producing spatio-temporal superpixels with high temporal consistency in a c Springer International Publishing Switzerland 2016  G. Hua and H. J´ egou (Eds.): ECCV 2016 Workshops, Part III, LNCS 9915, pp. 789–803, 2016. DOI: 10.1007/978-3-319-49409-8 65

790

M. Keuper and T. Brox

Fig. 1. Results of the proposed hierarchical video segmentation method for frame 4, 14 and 24 of the ballet sequence from VSB100 [10]. The segmentation is displayed in a hot color map. Note that corresponding contours have exactly the same value. Segmentations at different thresholds in this contour map are segmentations of the spatio-temporal volume.

bottom-up way (Fig. 1). To this aim, we employ an affinity measure, that has recently been proposed for image segmentation [18]. While other, learning-based methods such as [3] slightl