Steerable Gaze Control for a
Video-based Virtual Surveillant

Principal Investigator: Chuck Dyer

  Department of Computer Sciences
  University of Wisconsin-Madison
  1210 West Dayton Street
  Madison, WI  53706-1685

  Telephone:    608-262-1965
  Fax:		608-262-9777
  E-mail:       dyer@cs.wisc.edu
  Web:		www.cs.wisc.edu/~dyer

Technical Area: Video Surveillance and Monitoring (VSAM)

Technical Staff

Project Goals

The goal of this project is to enable rapid, photorealistic scene visualization of real 3D environments such as urban areas and battlefields. Our approach is image-based in that the input is a set of imagery, and no auxiliary data sources such as terrain data or site models are used. This will encompass two primary types of visualizations. First, scene viewing by moving a virtual camera in order to view a real environment from user-selected viewpoints. Second, scene modification by making virtual scene changes and visualizing their effects from arbitrary viewpoints for visual simulations of real environments.

In order to accomplish this objective, we are developing methods for rapidly combining a set of images of a real environment. Video is used as a source of multiple views, in general taken by multiple cameras that are widely distributed around the environment. Output is a set of images to be viewed by a person or used as input to other image understanding algorithms. For both visualization and further processing we are focused on producing photorealistic images of novel views and smooth sequences of views. Thus the main emphasis is on image appearance, not surface reconstruction or model building.

Military Relevance

The military relevance of this effort includes the following activities for customers such as intelligence analysts, special forces operators, combat engineers, and scientists:

Battlefield and facility visualizations, walkthroughs, and flybys
Mission rehearsal
Mission planning
Site analysis
Treaty monitoring
Accident analysis
View customization, correction, and normalization for enhanced human or automated analysis
Visualization of simulated modifications to a real environment, e.g., remodel, demolish, move, and overlay, for enhancing "visual thinking"

In each of the above tasks the raw sensor data may not be well matched with its intended use. Different tasks require different views of a scene, and so the "optimal" views for a particular task may not have been captured. Also, a sensor may be time-shared for multiple uses in a single mission, e.g., when slewing between multiple targets, and interleaving target tracking with systematic scanning of the environment. For these reasons it is advantageous to synthesize a virtual video tuned to the operator's viewing preferences and task-specific targets and activities, thus enhancing capabilities for monitoring and comprehending areas of interest and assessing objects' dispositions.

Technical Approach

Two approaches are currently being researched. The first is called view morphing and is a method for view interpolation from two or more uncalibrated views. The second is called voxel coloring and is a method for combining many widely-separated views.

View Morphing

Approach
Image morphing techniques can generate compelling 2D transitions between images. However, differences in object pose or viewpoint often cause unnatural distortions in image morphs that are difficult to correct manually. Using basic principles of projective geometry, we have developed a simple extension to image morphing that correctly handles 3D projective camera and scene transformations. The technique, called view morphing, works by prewarping two images prior to computing a morph and then postwarping the interpolated images. Because no knowledge of 3D shape is required, the technique may be applied to photographs and drawings, as well as rendered scenes. The ability to synthesize changes both in viewpoint and image structure affords a wide variety of interesting 3D effects via simple image transformations.
The following figure shows the result of view morphing between two input images (left and right) of an object taken from two different viewpoints. The middle image was synthesized by our view morphing algorithm. You can also view a sequence of in-between views (184K MPEG), showing how the impression of smooth motion in depth is created by physically moving a virtual camera.
Results
- View Morphing from Predator Views
  The following example shows how view morphing can be applied to images from the Predator unmanned air vehicle for surveillance and monitoring applications.
  
  Input View 1 Synthesized View Input View 2
  
  The images on the left and right are the two input images, and the center image is a synthetic view produced by view morphing. (Click on any image to see a larger version.) An entire sequence of synthesized views (391K MPEG) between the two input views shows the smooth transition produced by the algorithm. For reference, the original input video sequence (1.6M MPEG) from which the above two input images were taken is also provided.
  The example shows that our approach can synthesize new views of comparable photorealistic quality to the original images. It should be noted that, since view morphing assumes scene points are visible in both input images, some features near the edges of the example fade in or out because they only appear in one of the two input images. Also, note that in the example no special processing has been done to account for moving objects. In the original scene there is a vehicle moving on the road, so its position changes between the two input images. The results here show the vehicle fading in and out of the two positions.
  While the synthesized view in the example above is not dramatically different from views in the original video sequence, this is not a restriction on the method; view morphing can potentially generate views that vary significantly from any view in the original video.
- View Morphing from a Single Image
  For scenes containing approximate bilateral symmetry, view morphing can be applied to a single image so as to create interpolations that show the object rotating in depth. For example, given an image of Leonardo da Vinci's Mona Lisa, the original and its mirror reflection image are used as the inputs to view morphing, producing the front-facing view below. Although the face and torso are not perfectly symmetric, the output image sequence conveys a convincing facial rotation.
  Two resolutions of Mona Lisa <--> Mona Lisa reflection:
  
  High resolution MPEG movie (198K, 10 frames)
  Low resolution MPEG movie (64K, 20 frames)
- View Morphing Between Two Different Objects
  Because our approach uses image morphing as an intermediate step, view morphing can also be used to create morphs between different objects. The following example shows two input images of two different faces, each at a different orientation. View morphing produces a sequence of in-between views (143K MPEG) that simultaneously interpolate shape, color, and pose.
- View Morphing for Dynamic Scenes
  Current work focuses on extending view morphing to handle scenes containing moving objects. As in the static case, the process starts with two basis images. Each input image is first segmented into layers representing different moving objects, and then sparse point correspondences are determined for each layer. Each layer is view morphed separately, and then the synthesized layers are combined. The result is a simultaneous transition in viewpoint and objects' positions.
  The following examples show some early results of this approach.
  
  Input View 1 Synthesized View Input View 2
  
  Input View 1 Synthesized View Input View 2
  
  Input View 1 Synthesized Movie Input View 2

Voxel Coloring

Approach
View synthesis for a virtual camera at arbitrary position and orientation, given many input views that are widely-distributed around an environment, is a very difficult problem. Two reasons that make this challenging are: (1) A method must be able to integrate information from images containing significant differences in the parts of the scene that are visible, and (2) since the desired results are photorealistic new views, a method must be "dense" so as to render images containing accurate texture and color information at every pixel.
We are developing a new approach called Voxel Coloring that reconstructs the "color" (radiance) at surface points in an unknown scene. Initially, we assume a static scene containing Lambertian surfaces under fixed illumination so the radiance from a scene point can be described simply by a scalar value, which we call color.
Coping with large visibility changes between images means solving the correspondence problem between images that are very different in appearance--a very difficult problem. Rather than use traditional methods such as stereo, we use a scene-based approach. That is, we represent the environment as a discretized set of voxels, and use an algorithm that traverses these voxels and colors those that are part of a surface in the scene. The advantage of this approach is that simple voxel projection determines candidates corresponding image pixels. A difficulty is that a given image pixel may not correspond to a particular projecting voxel if there is a closer voxel occluding the projecting voxel. For example, in the figure below the red voxel in the scene projects into the first and third images but not the second, because the blue voxel occludes it.

To solve this visibility problem we introduce a novel geometric constraint on the input camera positions that enables a single visibility ordering of the voxels to hold for every input viewpoint. This ordinal visibility constraint is satisfied whenever no scene point is contained within the convex hull of the input camera centers. Below are shown two simple camera configurations that satisfy this constraint. The left configuration shows a downward-facing camera that is moved 360 degrees around an object. The right configuration shows a rig of outward-facing cameras that are distributed around a sphere.

Scene reconstruction is complicated by the fact that a set of images can be consistent with more than one rigid scene. Determining a scene's spatial occupancy is therefore an ill-posed problem because a voxel contained in one consistent scene may not be contained in another. On the other hand, a voxel may be part of two different consistent scenes, but have different colors in each. To cope with this problem we say a voxel V is color invariant with respect to a set of images if, for every pair of voxelizations S and T that contain V and that are consistent with the images, we have color(V, S) = color(V, T). Using this invariant, we define a voxel coloring of a set of images to be the maximally consistent coloring.
We can now define the complete voxel coloring algorithm as:
```
  S={}  /* initial set of colored voxels is empty
  for i = 1 to r do   /* traverse each of r layers
    foreach V in the ith layer of voxels do
      project V into all images where V is visible
      if sufficient correlation of the pixel colors
         then add V to S
```
Performance Considerations

The original Voxel Coloring algorithm was designed for static scenes. Thus producing output quickly was not a priority. To use Voxel Coloring with dynamic scenes several key elements have changed from the original approach. Algorithmically, Voxel Coloring has been recast to take advantage of spatial coherence and temporal coherence. Experimentally, the amount of input data has been reduced significantly. Together, these changes allow us to contemplate implementing Voxel Coloring in real-time on conventional workstations.

The most significant change has been the development of a coarse-to-fine/multiresolution approach to Voxel Coloring which speeds up performance dramatically.

Coarse-To-Fine Processing

Octree methods are common in coarse-to-fine processing of volumes. We a similar technique for voxel coloring. By decomposing large voxels into smaller voxels, and then coloring the subdivided set of voxels, computation can be focuses on significant portions of the scene.

Because Voxel Coloring depends on statistical methods, a direct decomposition does not work correctly. Voxels that may appear empty at one resolution, could contain subvoxels which should be colored at a higher resolution. This problem is illustrated in the figure below:

To compensate for the abundance of false negatives, the algorithm performs a nearest neighbor search to augment the set of voxels already colored. Such a search relies on a high degree of spatial coherence. In particular, it is assumed that all small details (high spatial frequency) occur close to the details which can be detected at coarse resolution. The figure below illustrates the augmentation process through one iteration.

Depending on the final resolution of the scene, speedups can vary from 2 to 40 times that of the original algorithm. The higher the final resolution the greater the speedup. The scene being colored consisted of eight images radially placed around the figure. The image resolution as 640x480. The images were manually segmented.

The figure on the left illustrates the coarse-to-fine process as resolution increases from 32x32x32 to 256x256x256, doubling at each iteration. In the graph to the right the performance gain for coarse-to-fine processing is illustrated. Also included are timings for prewarping the image to eliminate radial distortion.

Dynamic Scene Processing Experiment

To investigate dynamic scene processing, a staging area was built for data capture. Four cameras were used to collect the data. Each camera was mounted at a corner of the area, and was calibrated using the planar implementation of Tsai's algorithm. The walls of the staging area were then covered with blue matte paper to facilitate automatic segmentation. Scene reconstructions were then created serially off-line and redisplayed as a 3D movie. Finally, a version of Voxel Coloring which takes advantage of temporal coherence was developed. The whole process is illustrated in the mpegs below. Dynamic Voxel Coloring is described below.

The MPEGs

	The input sequence was made up of 16 frames taken from four cameras. The cameras were place near the ceiling in the four corners of the room. The background was covered with blue matte paper to fascilitate segmentation. The resolution of the input images was 320x240. This mpeg shows the stream of input from each camera in turn.
	Voxel Coloring makes use of occlusion information which is built up during scene traversal. This mpeg shows how a particular image is projected onto each layer in turn. Note that the occluded pixels in the image are colored black as the mpeg progresses. Occluded pixels have been accounted for in scene space and cannot contribute further to the reconstruction.
	Static colorings were then created from each of the 16 sets of input frames. This mpeg shows the static coloring with a scene resolution of 256x256x256 voxels. Notice that the reconstruction is hollow. Also note that the photo realism is limited by the resolution of the input images (for example the head was roughly 15 high).
	This mpeg displays the colorings in sequence. The motion of the subject over time adds a great deal of realism to the visualization. (smaller movie)
	The final mpeg shows the voxel coloring algorithm running on a sequence of frames while displaying the output interactively. Each 3D frame is reconstructed in about a half a second on an SGI O2 R5000. To achieve this real-time performance the resolution of the scene has been reduced to 64x64x64 voxels.

Talks

Publications

S. M. Seitz and C. R. Dyer, Photorealistic scene reconstruction by voxel coloring, International Journal of Computer Vision, to appear.
R. A. Manning and C. R. Dyer, Interpolating view and scene motion by dynamic view morphing, Proc. Computer Vision and Pattern Recognition Conf., 1999, I-388 - I-394.
C. R. Dyer, Image-based visualization from widely-separated views, Proc. Image Understanding Workshop, 1998, 101-105.
R. A. Manning and C. R. Dyer, Interpolating view and scene motion by dynamic view morphing, Proc. Image Understanding Workshop, 1998, 323-330.
A. C. Prock and C. R. Dyer, Towards real-time voxel coloring, Proc. Image Understanding Workshop, 1998, 315-321.
G. S. Bestor, Recovering Feature and Observer Position by Projected Error Refinement, Ph.D. Dissertation, Computer Sciences Department Technical Report 1381, University of Wisconsin - Madison, August 1998.
S. M. Seitz and K. N. Kutulakos, Plenoptic image editing, Proc. 6th Int. Conf. Computer Vision, 1998, 17-24.
S. M. Seitz, Image-Based Transformation of Viewpoint and Scene Appearance, Ph.D. Dissertation, Computer Sciences Department Technical Report 1354, University of Wisconsin - Madison, October 1997.
S. M. Seitz and C. R. Dyer, Photorealistic scene reconstruction by voxel coloring, Proc. Computer Vision and Pattern Recognition Conf., 1997, 1067-1073.
C. R. Dyer, Image-based scene rendering and manipulation research at the University of Wisconsin, Proc. Image Understanding Workshop, 1997, 63-67.
S. M. Seitz and C. R. Dyer, View morphing: Uniquely predicting scene appearance from basis images, Proc. Image Understanding Workshop, 1997, 881-887.
S. M. Seitz and C. R. Dyer, Photorealistic scene reconstruction by voxel coloring, Proc. Image Understanding Workshop, 1997, 935-942.
S. M. Seitz and K. N. Kutulakos, Plenoptic image editing, Technical Report 647, Computer Science Department, University of Rochester, Rochester, NY, January 1997.
S. M. Seitz, Bringing photographs to life with view morphing, Proc. Imagina 97, 1997, 153-158.
S. M. Seitz and C. R. Dyer, View morphing, Proc. SIGGRAPH 96, 1996, 21-30.
S. M. Seitz and C. R. Dyer, Toward image-based scene representation using view morphing, Proc. 13th Int. Conf. Pattern Recognition, Vol. I, Track A: Computer Vision, 1996, 84-89.

Acknowledgment

This work is sponsored in part by the Defense Advanced Research Projects Agency (DARPA) and Rome Laboratory, Air Force Materiel Command, USAF, under agreement number F30602-97-1-0138.

Disclaimer

This is the site of a DARPA-sponsored contractor. The views and conclusions contained within this website are those of the web authors and should not be interpreted as the official policies, either expressed or implied, of the Defense Advanced Research Projects Agency or the United States Government.

Last modified: 2 November 1999

Steerable Gaze Control for a Video-based Virtual Surveillant