Discovering Discriminative Action Parts from
Mid-Level Video Representations

Michalis Raptis Iasonas Kokkinos Stefano Soatto
Disney Research Ecole Centrale Paris UCLA

appeared at IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 2012


We describe a mid-level approach for action recognition. From an input video, we extract salient spatio-temporal structures by forming clusters of trajectories that serve as candidates for the parts of an action. The assembly of these clusters into an action class is governed by a graphical model that incorporates appearance and motion constraints for the individual parts and pairwise constraints for the spatio-temporal dependencies among them. During training, we estimate the model parameters discriminatively. During classification, we efficiently match the model to a video using discrete optimization. We validate the model's classification ability in standard benchmark datasets and illustrate its potential to support a fine-grained analysis that not only gives a label to a video, but also identifies and localizes its constituent parts.

Supplementary Video

The following video contains examples from the test sets of Hollywood1 Human Action dataset (HOHA) and UCF-Sports dataset. The extracted trajectories are super-imposed on the images. Trajectories not associated with a node of the model are plotted only at their current location using ``white" colored dots. Whereas, the groups of trajectories associated with the nodes of the proposed model are colored ``yellow", ``blue", and ``green" depending which of the 3 parts of the action they represent. Note that for visualization reasons we plot on each frame only the 3 most recent samples of each track. The manually annotated bounding boxes are shown in ``cyan" color.

Hollywood Human Action Dataset 1 Annotation


Trajectory Grouping - Video Motion Segmentation- Trajectory Descriptors Code

(Code ver1.1)