Site Models, Calibration and Geolocation

An automated surveillance system can benefit greatly from the scene-specific knowledge provided by a site model. Some of the many VSAM tasks supported by an accurate 3D site model are:

computation of object geolocation;
visibility analysis to allow more effective sensor tasking;
geometric focus of attention, for example to task a sensor to monitor the door of a building;
suppression of false alarms in areas of foliage;
prediction of visual effects like shadows;
visualization of the geometric relationships between sensors, objects, and scene features;
simulation for planning best sensor placement and for debugging algorithms; and
landmark-based camera calibration.

Scene Representations

A wide variety of scene representations have been used in the VSAM testbed system over the past three years. Most of the variety is due to work in our first year of effort (1997), where we bootstrapped a representation of the Bushy Run demonstration site largely by hand. During the second and third years of the project, performed on the campus of CMU, we used a Compact Terrain Data Base (CTDB) model of campus, which supported almost all of our algorithmic needs.

USGS orthophoto. The United States Geological Survey (USGS) produces several digital mapping products that can be used to create an initial site model. These include 1) Digital Orthophoto Quarter Quad (DOQQ) - a nadir (down-looking) image of the site as it would look under orthographic projection . The result is an image where scene features appear in their correct horizontal positions. 2) Digital Elevation Model (DEM) - an image whose pixel values denote scene elevations at the corresponding horizontal positions. Each grid cell of the USGS DEM shown encompasses a 30-meter square area. 3) Digital Topographic Map (DRG) - a digital version of the popular USGS topo maps. 4) Digital Line Graph (DLG) - vector representations of public roadways and other cartographic features. Many of these can be ordered directly from the USGS EROS Data Center web site, located at URL http://edcwww.cr.usgs.gov. The ability to use existing mapping products from USGS or National Imagery and Mapping Agency (NIMA) to bootstrap a VSAM site model demonstrates that rapid deployment of VSAM systems to monitor trouble spots around the globe is a feasible goal.

Custom DEM. The Robotics Institute autonomous helicopter group mounted a high precision laser range finder onto a remote-control Yamaha helicopter to create a high-resolution (half-meter grid spacing) DEM of the Bushy Run site. Raw radar returns were collected with respect to known helicopter position and orientation (using on-board altimetry data) to form a cloud of points representing returns from surfaces in the scene. These points were converted into a DEM by projecting into LVCS horizontal-coordinate bins, and computing the mean and standard deviation of height values in each bin. For more information, see Collins, Tsin, Miller and Lipton, 1998.

Mosaics. The VSAM IFD team demonstrated coarse registration of a mosaic with a USGS orthophoto using a projective warp to determine an approximate mapping from mosaic pixels to geographic coordinates. It is feasible that this technology could lead to automated methods for updating existing orthophoto information using fresh imagery from a recent fly-through. For example, seasonal variations such as fresh snowfall (as in the case of VSAM Demo I) can be integrated into the orthophoto.

VRML models. Using data collected with close-range aerial data and a ground-based handheld camcorder, a VRML model of one of the Bushy Run buildings and its surrounding terrain was created. This model was created using the factorization method developed by Tomasi and Kanade at CMU.

Compact Terrain Data Base (CTDB). Currently, the VSAM testbed system uses a Compact Terrain Data Base (CTDB) model of campus as its primary site model representation. The CTDB is designed to represent large expanses of terrain within the context of advanced distributed simulation, and has been optimized to efficiently answer geometric queries such as finding the elevation at a point in real-time. Terrain can be represented as either a grid of elevations, or as a Triangulated Irregular Network (TIN), and hybrid data bases containing both representations are allowed. The CTDB also represents relevant cartographic features on top of the terrain skin, including buildings, roads, bodies of water, and tree canopies. An important benefit to using CTDB as a site model representation for VSAM processing is that it allows us to easily interface with the synthetic environment simulation and visualization tools provided by ModSAF and ModStealth.

Spherical Representations. Everything that can be seen from a stationary camera can be represented on the surface of a viewing sphere. This is true even if the camera is allowed to pan and tilt about the focal point, and to zoom in and out -- the image at any given (pan,tilt,zoom) setting is essentially a discrete sample of the bundle of light rays impinging on the camera's focal point. We have used this idea to create spherical mosaics and to design spherical lookup tables containing the 3D locations and surface material types of the points of intersection of camera viewing rays with the scene.

Higher Resolution:

Orthophoto
DEM
Mosaic
VRML
CTDB
Spherical

CMU campus
model (CTDB)

Camera Calibration

For a VSAM system to make full use of a geometric site model requires calibrating the cameras with respect to the model. We have developed a set of calibration procedures for in-situ (meaning "in place'') camera calibration. We believe that all cameras should be calibrated in an environment that resembles their actual operating conditions. This philosophy is particularly relevant for outdoor camera systems -- cameras get jostled during transport and installation, and changes in temperature and humidity can affect a camera's intrinsic parameters. Furthermore, it is impossible to recreate the full range of zoom and focus settings that are useful to an outdoor camera system within the confines of an indoor lab.

We have developed methods for fitting a projection model consisting of intrinsic (lens) and extrinsic (pose) parameters of a camera with active pan, tilt and zoom control. Intrinsic parameters are calibrated by fitting parametric models to the optic flow induced by rotating and zooming the camera. These calibration procedures are fully automatic and do not require precise knowledge of 3D scene structure. Extrinsic parameters are calculated by sighting a sparse set of measured landmarks in the scene. Actively rotating the camera to measure landmarks over a virtual hemispherical field of view leads to a well-conditioned exterior orientation estimation problem. Details of the calibration procedures are presented in Collins and Tsin, 1999.

Model-based Geolocation

Current video understanding techniques operate primarily in image space. A large leap in terms of descriptive power can be made by transforming image blobs and measurements into 3D scene-based objects and descriptors. In particular, determination of object location in the scene allows us to infer the proper spatial relationships between sets of objects, and between objects and scene features such as roads and buildings. Furthermore, computation of 3D spatial geolocation is the key to coherently integrating a large number of object hypotheses from multiple, widely-spaced sensors.

In regions where multiple sensor viewpoints overlap, object locations can be determined by stereo triangulation. However, regions of the scene that can be simultaneously viewed by multiple sensors are likely to be a small percentage of the total area in real outdoor surveillance applications, where it is desirable to maximize coverage of a large area using finite sensor resources. Determining object locations from a single sensor requires domain constraints, in this case the assumption that the object is in contact with the terrain. This contact location is estimated by passing a viewing ray through the bottom of the object in the image and intersecting it with a model representing the terrain. Sequences of location estimates over time are then assembled into consistent object trajectories. For more details, see Collins, Tsin, Miller and Lipton, 1998.

Geolocation by intersecting viewing rays with the terrain.

Geolocation to determine a vehicle's trajectory

We have evaluated geolocation accuracy for two cameras (PRB and Wean) on the CMU campus using a Leica laser-tracking theodolite to generate ground truth. The experiment was run by having a person carry the theodolite prism for two loops around the PRB parking lot, while the system logged time-stamped horizontal (X,Y) positions measured by the theodolite. The system also simultaneously tracked the person using the PRB and Wean cameras, while logging time-stamped geolocation estimates from each camera. Standard deviations of geolocation estimates from each camera are roughly on the order of .6 meters along the axis of maximum spread, and roughly .25 meters at minimum. The axis of maximum error for each camera is oriented along the direction vector from the camera to the object being observed. For more details, see Collins et.al. 2000.