Gianfranco Doretto / Research / Project
Dynamic Texture Editing
Plenoptic modeling for viewpoint editing
Description
In this project we consider the following problem: we are given images
of a dynamic scene that exhibits some sort of temporal regularity,
taken from one or more moving cameras, and we want to extract a model
that explains the measured data. Therefore, we are interested in
inferring models of the temporal statistical properties of visual
scenes that can be used to generate synthetic sequences of images,
where both the temporal statistics and the motion of the vantage point
can be edited.
Images of objects with complex shape, motion and material properties
are commonplace in our visual world: think of a silk gown, a burning
flame, a waterfall, etc. The complexity of these physical phenomena is
far superior to that of the images that they generate, and it is well
known that the inverse problem of visual perception is intrinsically
ill-posed. This means that, unless additional prior assumptions are
available, one has to give up the goal of inferring “the” model
of the physical world, and settle for a much poorer representation,
one that can explain the measured data. Therefore, we will let our
task guide our assumptions on the representation to lead to a
well-posed inference problem.
We make the assumption that, associated to a dynamic scene that
exhibits temporal regularity, there is a light field energy
that can be modeled as a stationary process. In this way the
images (or filtered versions of the images) can be modeled as the
output of a time-varying dynamical model. The model, together with a
stochastic input, represents the dynamic variability of the image
sequence. In addition, we explicitly model the vantage point, so that
during the synthesis, we can change it arbitrarily and render
sequences from a camera undergoing a virtual motion. The resulting
algorithms could be useful for video editing where the motion of the
vantage point can be controlled interactively, as well as to perform
stabilized synthetic generation of video sequences.
The main contributions of our approach are:
- We pose the problem of building an Image Based Rendering framework to jointly model temporal dynamics and vantage point of complex dynamic scenes that exhibit temporal regularity.
- We propose a general model for vantage point and temporal dynamics that is based on a time-variant dynamic system, and derive two simplified versions of it: one that leads to the interpolation of time-invariant dynamic systems, and another one that deals with the simpler case of planar scenes.
- For the two simplifications of the general model we give algorithms to infer model parameters in closed-form and iteratively.
- By simulating the simplified models, we found that they can capture the temporal dynamics of the scene. As far as controlling the vantage point, we found that the number of models used in the interpolation restricts the set of possible trajectories of a virtual motion, while in the case of planar scenes one can have full control.
Results
The following examples demonstrate the power of our approach to
extrapolate and manipulate new video sequences. Given a training
sequence we apply the learning procedure to learn model parameters. We
then simulate and control the model to render a video sequence
corresponding to a camera undergoing a virtual motion.
Note that the learning procedure has been applied directly to the raw
data, and no preprocessing has been performed. Also, for portability
issues, the .avi movies are MPEG compressed (video coder V1), and the
quality of the synthesized images has degraded accordingly.
Model interpolation: inverted-fountain
In this first example we sample the light energy field at different
positions. At each position we learn a model. A virtual motion will
then correspond to an interpolation of these models. Here we sample 6
positions and obtain sequences of 120 color frames. From the first to
the last sequence the camera is approximately sampling a circular
trajectory that pans around the fountain. We then synthesize a virtual
motion by concatenating forward and backward the 6 models inferred
from the 6 sequences. The resulting movie appears to be made by a
camera that is panning smoothly around the inverted-fountain on a
circular trajectory, as we would have expected to see. Notice that the
dynamics of the scene is fairly well captured by the model. The movie
is 240 frames long.
Download a training set sample .avi movie [1.69MB]
Download the synthesis .avi movie [1.01MB]
Download a training set sample .avi movie [1.69MB]
Download the synthesis .avi movie [1.01MB]
Model interpolation: waterfall
This is another example where we apply model interpolation. Here the
data set consists of 21 sequences of 120 color frames, that almost
uniformly sample the light energy field in a portion of the 3D
space. By using a distance defined in the space of models, we compute
the connectivity graph between models. We then use the graph to
automatically extract 6 models along the shortest path that goes
through 3 manually selected key-models. From the first key-model to
the second one the camera is moving closer to the scene; from the
second to the third key-model the camera is panning around the scene;
from the third key-model the virtual motion simply go back to the
second one. The resulting movie appears to be made by a camera that is
moving forward and then panning around the scene back and forth. The
movie is 200 frames long.
Download a training set sample .avi movie [2.06MB]
Download the synthesis .avi movie [1.60MB]
Download a training set sample .avi movie [2.06MB]
Download the synthesis .avi movie [1.60MB]
Planar approximation: fountain-corner
In this example we use the simplification of the general model for
scenes well approximated by a plane. From a sequence of 170 color
images taken by a moving camera we learn the motion of the vantage
point and the model of the dynamic texture. The synthesized camera
motion is such that the camera is first zooming in, then translating
to the left, turning to the left, right, and finally zooming out. In
this case the model allows full editing power of the vantage
point. The synthesized sequence is 200 frames long.
Download the training set .avi movie [485KB]
Download the synthesis .avi movie [184KB]
Download the training set .avi movie [485KB]
Download the synthesis .avi movie [184KB]
Planar approximation: waterfall-2
This is another example where the planar approximation can be
applied. From a training sequence of 130 frames we synthesize 200
frames while controlling the vantage point. The synthesized camera
motion is such that the camera is first zooming in, than translating
to the left, down, right, up, and finally rotating to the left. Notice
that dynamics of the scene if very well captured by the model.
Download the training set .avi movie [1.83MB]
Download the synthesis .avi movie [529KB]
Download the training set .avi movie [1.83MB]
Download the synthesis .avi movie [529KB]
Related publications
- Doretto, G. and Soatto, S.
Dynamic shape and appearance models. IEEE Transactions on Pattern Analysis and Machine Intelligence, 28(12):2006–2019, IEEE Computer Society, Los Alamitos, CA, USA, 2006.
Details BibTeX PDF (1.7MB ) - Doretto, G. and Soatto, S.
Towards plenoptic dynamic textures. In Proceedings of the 3rd International Workshop on Texture Analysis and Synthesis, pp. 25–30, Nice, France, October 2003.
Details BibTeX PDF (987.7kB ) - Doretto, G. and Soatto, S.
Editable dynamic textures. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 137–142, Madison, Wisconsin, USA, June 2003.
Details BibTeX PDF (507.5kB ) - Doretto, G., Chiuso, A., Wu, Y. N., and Soatto, S.
Dynamic textures. International Journal of Computer Vision, 51(2):91–109, 2003.
Details BibTeX PDF (2.6MB )