Semantic Video Segmentation From Occlusion Relations Within a Convex Optimization Framework


We describe an approach to incorporate scene topology and semantics into pixel-level object detection and localization. Our method requires video to determine occlusion regions and thence local depth ordering, and any visual recognition scheme that provides a score at local image regions, for instance object detection probabilities. We set up a cost functional that incorporates occlusion cues induced by object boundaries, label consistency and recognition priors, and solve it using a convex optimization scheme. We show that our method improves localization accuracy of existing recognition approaches, or equivalently provides semantic labels to pixel-level localization and segmentation

Framework Overview

A simple approach to semantic video segmentation is to apply a semantic image classifier to each frame and smooth the result (e.g. with a conditional random field). However, this can this approach makes poor use of video information, can cause semantic labels to bleed across object boundaries, and sometimes being miss small objects altogether. In contrast, our approach leverages the additional frames to use a motion-based cue (occlusions) for finding object boundaries to aid the semantic segmentation task. We combine a generic image-based classifier with object boundary cues from occlusions to produce both the semantic labeling and a depth-layer labels (which roughly describes an object's distance from the viewer with respect to other objects in the scene). This allows us to not only say where people appear in the image, but also that there are two people, and that one is closer to the viewer than the other.

Sample Classification Outcomes

Above are outcomes on the Moseg dataset [1]. Here we tested our system with 3 object classes (person in red, car in blue, and background uncolored). The columns from left to right show the original frame overlaid with the labels from the groundtruth annotation, a quickly-trained textonboost classifier [3], a hierarchical CRF superpixel-based classifier [4], and our approach. Since only moving objects are annotated in the ground-truth, we ignored static regions (light green) from the evaluation to avoid unfairly penalizing the comparison methods operating on single images.

Above are outcomes on the CamVid dataset [2], where we evaluated our system on 11 semantic categories shown in the legend. From left to right we similarly have the labels from groundtruth, textonboost [3], the hierarchical classifier [4], and our appraoch. Since our method leverages a motion-based cue (occlusions) for object detection, we can only improve over single-frame semantic classification approaches on object categories (i.e. car, pedestrian, bicyclist, column-pole, and sign-symbol). We highlight the regions showing improvement on object classification (here, cars) with white boxes.

For further details, please see our paper:


  1. T. Brox and J. Malik. Object segmentation by long term analysis of point trajectories. ECCV 2010

  2. G. J. Brostrow, J. Shotton, J. Fauqueur, and R. Cipolla. Segmentation and recognition using structure from motion point clouds. ECCV 2008

  3. J. Shotton, J. Winn, C. Rother, and A. Criminisi. Textonboost: Joint appearance, shape and context modeling for multi-class object recognition and segmentation. ECCV 2006

  4. L. Ladicky, C. Russell, P. Kohli, and P. Torr. Associative hierarchical CRFs for object class image segmentation. ICCV 2009

If you have any questions, please contact Brian Taylor.