|Michalis Raptis||Iasonas Kokkinos||Stefano Soatto|
|Disney Research||Ecole Centrale Paris||UCLA|
appeared at IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 2012
AbstractWe describe a mid-level approach for action recognition. From an input video, we extract salient spatio-temporal structures by forming clusters of trajectories that serve as candidates for the parts of an action. The assembly of these clusters into an action class is governed by a graphical model that incorporates appearance and motion constraints for the individual parts and pairwise constraints for the spatio-temporal dependencies among them. During training, we estimate the model parameters discriminatively. During classification, we efficiently match the model to a video using discrete optimization. We validate the model's classification ability in standard benchmark datasets and illustrate its potential to support a fine-grained analysis that not only gives a label to a video, but also identifies and localizes its constituent parts.
Supplementary VideoThe following video contains examples from the test sets of Hollywood1 Human Action dataset (HOHA) and UCF-Sports dataset. The extracted trajectories are super-imposed on the images. Trajectories not associated with a node of the model are plotted only at their current location using ``white" colored dots. Whereas, the groups of trajectories associated with the nodes of the proposed model are colored ``yellow", ``blue", and ``green" depending which of the 3 parts of the action they represent. Note that for visualization reasons we plot on each frame only the 3 most recent samples of each track. The manually annotated bounding boxes are shown in ``cyan" color.
Hollywood Human Action Dataset 1 Annotation
Trajectory Grouping - Video Motion Segmentation- Trajectory Descriptors Code