Gianfranco Doretto / Research / Project

Dynamic Texture Modeling

Modeling temporal stationarity


Dynamic textures are sequences of images of moving scenes that exhibit temporal regularity, intended in a statistical sense, like sea-waves, smoke, foliage, whirlwind but also talking faces, traffic scenes etc. We present a characterization of dynamic textures that poses the problems of modeling, learning, and synthesis of this type of sequences.
Consider a sequence of images of a moving scene. Each image is an array of positive numbers that depend upon the shape, pose and motion of the scene as well as upon its material properties (reflectance) and on the light distribution of the environment. It is well known that the joint reconstruction of photometry and geometry is an intrinsically ill-posed problem: from any (finite) number of images it is not possible to uniquely recover all unknowns (shape, motion, reflectance and light distribution). Given this arbitrariness in the reconstruction and interpretation of visual scenes, it is clear that there is no notion of a "true" interpretation, and the criterion for correctness is somewhat arbitrary (with the exception of humans that can use prior information and other sensory modalities, such as touch). For this reason, in this project we analyze sequences of images of moving scenes solely as visual signals, and interpreting and understanding a signal amounts to inferring a stochastic model that generates it. The goodness of the model can be measured in terms of the total likelihood of the measurements or in terms of its predicting power: a model should be able to give accurate predictions of future signals. In a sense, we look for an explanation of the image data that allows us to recreate and extrapolate it. It can therefore be thought of as the compressed version or the essence of the sequence of images.
By making the assumption that temporal regularity of video sequences translates into statistical stationarity of video signals, the general model for a dynamic texture is a dynamical system. We proved that even the simplest instance of the model (i.e. linear dynamic sistem) can capture a variety of natural phenomena.
The main contributions of our approach are:
  • Representation: we present a novel definition of dynamic texture that is general (even the simplest instance can capture the second-order statistics of a video signal), and precise (it allows making analytical statements and drawing from the rich literature on system identification).
  • Learning: we propose two criteria: total likelihood or prediction error. For the simplest instance of the model we propose a closed-form solution for the learning problem.
  • Recognition: we found that textures alike tend to cluster in model space, and assessed potential for recognition of dynamic visual processes.
  • Synthesis: we found that even the simplest model (first-order autoregressive moving average model with Gaussian input) captures a wide range of natural phenomena.
  • Implementation: our algorithm is simple to implement, efficient to learn and fast to simulate; it allows one to generate infinitely long sequences from short input sequences.


The following examples demonstrate the power of our model to extrapolate new video sequences. Given a training sequence we apply the learning procedure and extract the parameters of the model. We then simulate the model to synthesize new video sequences.
Note that the learning procedure has been applied directly to the raw data, and no preprocessing has been performed. Also, for portability issues, the .avi movies are MPEG compressed (video coder V1), and the quality of the synthesized images has degraded accordingly.

Grayscale sequences

In the following example, from four training sequences (smoke, fountain, river waves, and curtain) of 100 grayscale frames each, we synthesize 300 frames. (The training sequences have been borrowed from the MIT temporal texture database.)
Download .avi movie [2.3Mb]


This example shows 100 frames of a color training sequence and 200 synthesized frames. Note that we do not require the video sequence to exhibit spatial regularity. In fact, the model aims at modeling temporal correlation only, while spatial correlation can be different at every point of the image plane.
Download .avi movie [1.04Mb]

Ocean waves

This example shows 100 frames of a color training sequence and 200 synthesized frames. Here the video sequence exhibits a lot of spatial regularity. Notice that the little highlights of the training sequence are filtered out in the synthesized sequence. In fact, one could use our model to perform video sequence denoising.
Download .avi movie [722Kb]


This example shows 100 frames of a color training sequence and 200 synthesized frames. This video sequence is far from being Gaussian, and while the model is linear, the synthesized outcome still has the temporal dynamics very well preserved, and the images that look appealing. (The training sequence has been borrowed from the Artbeats Digital Film Library.)
Download .avi movie [890Kb]


This example shows 100 frames of a color training sequence and 200 synthesized frames. Note that the training sequence is full of highlights and this makes the leaning procedure much more difficult. Nevertheless, the synthesized video sequence has the temporal dynamics very well preserved, while the quality of the images has clearly degraded. 
Download .avi movie [1.37Kb]

Related publications

  • Doretto, G., Chiuso, A., Wu, Y. N., and Soatto, S.
    Dynamic textures.
    International Journal of Computer Vision, 51(2):91–109, 2003.
    Details   BibTeX   PDF (2.6MB )  
  • Soatto, S., Doretto, G., and Wu, Y. N.
    Dynamic textures.
    In Proceedings of IEEE International Conference on Computer Vision, pp. 439–446, Vancouver, BC, Canada, July 2001.
    Oral Presentation
    Details   BibTeX   PDF (929.6kB )  
  • Doretto, G., Pundir, P., Wu, Y. N., and Soatto, S.
    Dynamic textures.
    Technical Report TR200032, UCLA Computer Science Department, 2000.
    Details   BibTeX   PDF (586.6kB )