Gianfranco Doretto / Research / Project

Dynamic Texture Modeling

Modeling spatial and temporal stationarity

Description

Dynamic textures are sequences of images of moving scenes that exhibit temporal regularity, like sea-waves, smoke, foliage, traffic scenes etc. An important subset of dynamic textures is the one where the sequences exhibit not only temporal regularity, but also spatial regularity. We present a characterization of this kind of dynamic textures, and pose the problems of modeling, learning, and synthesis of this type of sequences.
Dynamic textures that are spatially regular (or homogeneous) are commonplace in several regions of video sequences of natural scenes as well as 2D texture images are. Therefore, having a model that is able to jointly capture the essence of spatial and temporal structure of this kind of video sequences is a fundamental step in a variety of applications ranging from video compression/transmission to video segmentation, and ultimately recognition.
While it has been observed that the distribution of intensity levels in natural images is  highly kurtotic, such a distribution is mainly due to the presence of occlusions or boundaries delimiting statistically homogeneous regions. Therefore, within such regions it makes sense to employ the simplest possible model that can capture at least the second-order statistics. As far as capturing the temporal regularity, it has been shown that linear Gaussian models of high enough order produce synthetic sequences that are perceptually indistinguishable from the originals, for sequences of natural phenomena that are well-approximated by stationary processes.
In this work we make the assumption that temporal and spatial regularity of video sequences translates into statistical temporal and spatial stationarity of video signals. We propose to model the spatio-temporal stationarity of video signals with an extension of a simple class of multiscale autoregressive models. We show how model parameters can be efficiently learned, and how they can be employed to synthesize sequences that extend in both space and time the original ones.
The main contributions of our approach are:
  • Modeling: we characterize dynamic textures that are spatially homogeneous, and propose a new model that is an extension of a class of multiscale autoregressive models.
  • Learning: we propose to learn the model using maximum likelihood, but also derive a closed-form sub-optimal solution for the efficient computation of the parameters that is based on SVD and least squares.
  • Synthesis: we found that even the simplest model (that in space simulates a second-order Markov random field) captures a wide range of natural phenomena.
  • Compression: we found that modeling spatio-temporal stationarity instead of only temporal stationarity allows an increase of the compression ratio of the order of hundreds.
  • Implementation: our algorithm is simple to implement, efficient to learn and fast to simulate; it allows one to generate sequences that extend in both space and time the original ones.

Results

The following examples demonstrate the power of our model to extrapolate new video sequences in both space and time. Given a training sequence we apply the learning procedure and extract the parameters of the model. We then simulate the model to synthesize new video sequences.
To better satisfy the hypothesis of the model that requires stationarity in both space and time, we normalize the mean and variance of each sequence before running the learning algorithm. Notice that the training sequences are, of course, not perfectly stationary (especially in space), and the model infers the "average" spatial structure of the original sequence. Also, for portability issues, the .avi movies are MPEG compressed (video coder V1), and the quality of the synthesized images has degraded accordingly.

Boiling water

This example shows 100 frames of a color training sequence and 300 synthesized frames. As one can see from the synthesis results the model captures the spatial structure as well as the very vibrating temporal dynamics. We stress the fact that the training sequences are, of course, not perfectly stationary (especially in space), and the model infers the "average" spatial structure of the original sequence.
Download .avi movie [1.46MB]

Fountain

This example shows 100 frames of a color training sequence and 300 synthesized frames. As one can see from the synthesis results the model captures the spatial structure as well as the temporal dynamics. In fact, even with the spatial extension, one can clearly perceive the water falling down consistently.
Download .avi movie [1.27MB]

Ocean waves

This example shows 100 frames of a color training sequence and 300 synthesized frames. The synthesis results show that the waves appearance and movement are well captured by the model. Notice that the little highlights of the training sequence are filtered out in the synthesized sequence. In fact, one could use our model to perform video sequence denoising.
Download .avi movie [371KB]

Waterfall

This example shows 100 frames of a color training sequence and 300 synthesized frames. Also in this example spatial structure and dynamics are well captured.
Download .avi movie [540KB]

Fire

Also in this example from 100 frames of a color training sequence we synthesize 300 frames. This example has been inserted to show what happen when the hypothesis of spatial homogeneity is broken. In fact, in this sequence of fire the spatial stationarity assumption is strongly violated, and the model captures a "homogenized" spatial structure that generates rather different images from those of the training sequence. Moreover, since the learning procedure factorizes the training set by first learning the spatial parameters, and relies on these estimates to infer the temporal parameters, also the temporal statistics (temporal correlation) appear corrupted, if compared with the one of the original sequence.
Download .avi movie [2.02MB]

Related publications

  • Doretto, G., Jones, E., and Soatto, S.
    Spatially homogeneous dynamic textures. In Proceedings of European Conference on Computer Vision, pp. 591–602, Prague, Czech Republic, May 2004.
    Oral Presentation
    Details   BibTeX   PDF (559.5kB )  
  • Doretto, G., Chiuso, A., Wu, Y. N., and Soatto, S.
    Dynamic textures. International Journal of Computer Vision, 51(2):91–109, 2003.
    Details   BibTeX   PDF (2.6MB )  

Jamp to