Gianfranco Doretto / Research / Project

Dynamic Texture Editing

Plenoptic modeling for viewpoint editing

Description

In this project we consider the following problem: we are given images of a dynamic scene that exhibits some sort of temporal regularity, taken from one or more moving cameras, and we want to extract a model that explains the measured data. Therefore, we are interested in inferring models of the temporal statistical properties of visual scenes that can be used to generate synthetic sequences of images, where both the temporal statistics and the motion of the vantage point can be edited.
Images of objects with complex shape, motion and material properties are commonplace in our visual world: think of a silk gown, a burning flame, a waterfall, etc. The complexity of these physical phenomena is far superior to that of the images that they generate, and it is well known that the inverse problem of visual perception is intrinsically ill-posed. This means that, unless additional prior assumptions are available, one has to give up the goal of inferring “the” model of the physical world, and settle for a much poorer representation, one that can explain the measured data. Therefore, we will let our task guide our assumptions on the representation to lead to a well-posed inference problem.
We make the assumption that, associated to a dynamic scene that exhibits temporal regularity, there is a light field energy that can be modeled as a stationary process. In this way the images (or filtered versions of the images) can be modeled as the output of a time-varying dynamical model. The model, together with a stochastic input, represents the dynamic variability of the image sequence. In addition, we explicitly model the vantage point, so that during the synthesis, we can change it arbitrarily and render sequences from a camera undergoing a virtual motion. The resulting algorithms could be useful for video editing where the motion of the vantage point can be controlled interactively, as well as to perform stabilized synthetic generation of video sequences.
The main contributions of our approach are:
  • We pose the problem of building an Image Based Rendering framework to jointly model temporal dynamics and vantage point of complex dynamic scenes that exhibit temporal regularity.
  • We propose a general model for vantage point and temporal dynamics that is based on a time-variant dynamic system, and derive two simplified versions of it: one that leads to the interpolation of time-invariant dynamic systems, and another one that deals with the simpler case of planar scenes.
  • For the two simplifications of the general model we give algorithms to infer model parameters in closed-form and iteratively.
  • By simulating the simplified models, we found that they can capture the temporal dynamics of the scene. As far as controlling the vantage point, we found that the number of models used in the interpolation restricts the set of possible trajectories of a virtual motion, while in the case of planar scenes one can have full control.

Results

The following examples demonstrate the power of our approach to extrapolate and manipulate new video sequences. Given a training sequence we apply the learning procedure to learn model parameters. We then simulate and control the model to render a video sequence corresponding to a camera undergoing a virtual motion.
Note that the learning procedure has been applied directly to the raw data, and no preprocessing has been performed. Also, for portability issues, the .avi movies are MPEG compressed (video coder V1), and the quality of the synthesized images has degraded accordingly.

Model interpolation: inverted-fountain

In this first example we sample the light energy field at different positions. At each position we learn a model. A virtual motion will then correspond to an interpolation of these models. Here we sample 6 positions and obtain sequences of 120 color frames. From the first to the last sequence the camera is approximately sampling a circular trajectory that pans around the fountain. We then synthesize a virtual motion by concatenating forward and backward the 6 models inferred from the 6 sequences. The resulting movie appears to be made by a camera that is panning smoothly around the inverted-fountain on a circular trajectory, as we would have expected to see. Notice that the dynamics of the scene is fairly well captured by the model. The movie is 240 frames long.
Download a training set sample .avi movie [1.69MB]
Download the synthesis .avi movie [1.01MB]

Model interpolation: waterfall

This is another example where we apply model interpolation. Here the data set consists of 21 sequences of 120 color frames, that almost uniformly sample the light energy field in a portion of the 3D space. By using a distance defined in the space of models, we compute the connectivity graph between models. We then use the graph to automatically extract 6 models along the shortest path that goes through 3 manually selected key-models. From the first key-model to the second one the camera is moving closer to the scene; from the second to the third key-model the camera is panning around the scene; from the third key-model the virtual motion simply go back to the second one. The resulting movie appears to be made by a camera that is moving forward and then panning around the scene back and forth. The movie is 200 frames long.
Download a training set sample .avi movie [2.06MB]
Download the synthesis .avi movie [1.60MB]

Planar approximation: fountain-corner

In this example we use the simplification of the general model for scenes well approximated by a plane. From a sequence of 170 color images taken by a moving camera we learn the motion of the vantage point and the model of the dynamic texture. The synthesized camera motion is such that the camera is first zooming in, then translating to the left, turning to the left, right, and finally zooming out. In this case the model allows full editing power of the vantage point. The synthesized sequence is 200 frames long.
Download the training set .avi movie [485KB]
Download the synthesis .avi movie [184KB]

Planar approximation: waterfall-2

This is another example where the planar approximation can be applied. From a training sequence of 130 frames we synthesize 200 frames while controlling the vantage point. The synthesized camera motion is such that the camera is first zooming in, than translating to the left, down, right, up, and finally rotating to the left. Notice that dynamics of the scene if very well captured by the model.
Download the training set .avi movie [1.83MB]
Download the synthesis .avi movie [529KB]

Related publications

  • Doretto, G. and Soatto, S.
    Dynamic shape and appearance models. IEEE Transactions on Pattern Analysis and Machine Intelligence, 28(12):2006–2019, IEEE Computer Society, Los Alamitos, CA, USA, 2006.
    Details   BibTeX   PDF (1.7MB )  
  • Doretto, G. and Soatto, S.
    Towards plenoptic dynamic textures. In Proceedings of the 3rd International Workshop on Texture Analysis and Synthesis, pp. 25–30, Nice, France, October 2003.
    Details   BibTeX   PDF (987.7kB )  
  • Doretto, G. and Soatto, S.
    Editable dynamic textures. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 137–142, Madison, Wisconsin, USA, June 2003.
    Details   BibTeX   PDF (507.5kB )  
  • Doretto, G., Chiuso, A., Wu, Y. N., and Soatto, S.
    Dynamic textures. International Journal of Computer Vision, 51(2):91–109, 2003.
    Details   BibTeX   PDF (2.6MB )  

Jamp to