Sensor Coordination

In most complex outdoor scenes, it is impossible for a single sensor to maintain its view of an object for long periods of time due to occlusion by environmental features such as trees and buildings. A promising solution to this problem is to use a network of video sensors to cooperatively track objects through the scene. We have developed and demonstrated two methods of sensor coordination in the VSAM testbed. First, objects are tracked long distances through occlusion by handing-off between cameras situated along the object's trajectory. Second, wide-angle sensors keeping track of all objects in a large area are used to task sensors with active pan/tilt/zoom control to get a better view of selected objects, using a process known as sensor slaving.

Multi-Sensor Handoff

Multiple active video sensors cooperatively track a moving object using the object's 3D geolocation to determine where each sensor should look. The testbed system automatically controls the pan and tilt of the closest sensors to bring the predicted future position of the object within their fields of view. An appropriate camera zoom setting is determined given a desired size of the object's projection in the image, knowing the classification of the object, and employing the heuristic that humans are approximately 6 feet (2m) tall and vehicles are approximately 15 feet (5m) long.

Once the sensor is pointing in the right direction at the right zoom factor, all moving objects extracted are compared to the specific object of interest to see if they match. This need to re-acquire a specific object is a key feature necessary for multi-camera cooperative surveillance. Obviously viewpoint-specific appearance criteria are not useful, since the new view of the object may be significantly different from the previous view. Therefore, recognition features are needed that are independent of viewpoint. In our work we use two such criteria: the object's 3D scene trajectory as determined from geolocation, and a normalized color histogram of the object's image region.

Two sensors cooperate to actively track a vehicle through a cluttered environment.

Sensor Slaving

A second form of sensor cooperation is sensor slaving. The term denotes using one wide field of view camera to control a second camera to zoom in and actively track the subject to generate a better view. The motivation is to keep track of all objects in the scene while simultaneously gathering high-resolution views of selected objects. A camera slaving system has at least one master camera and one slave camera. The master camera is set to have a global view of the scene so that it can track objects over extended areas using simple tracking methods such as adaptive background subtraction. The object trajectory generated by the master camera is relayed to the slave camera in real time. The slave camera, which is highly zoomed in, can then follow the trajectory to generate close-up imagery of the object.

Slaving is a relatively simple exercise if both cameras are calibrated with respect to a local 3D terrain model. We have shown that a tracked object's 3D geolocation can be determined to reasonable accuracy (roughly 1 meter of error for a person 50 meters away) by intersecting backprojected viewing rays with the terrain. After estimating the 3D location of a person from the first camera's viewpoint, it is an easy matter to transform the location into a pan-tilt command to control the second camera. The figure below shows an example of camera slaving. A person has been detected automatically in the wide-angle view shown in the left image, and a second camera has been tasked to move slightly ahead of the person's estimated 3D trajectory, as shown in the right image.

	Example of multi-camera slaving -- tracking a person.
	Example of multi-camera slaving -- tracking a vehicle.

For cameras located far apart geographically, it is obvious that we need to have very good camera calibration, and an accurate 3D site model. We have also developed a sensor slaving method that works for closely located cameras. This method requires only image-based computations (no geolocation computation or extrinsic camera calibration). Furthermore, intrinsic parameters are needed only by the slave camera, which has to determine the pan/tilt angles needed to point towards each pixel in the image. The basic idea is to form a mosaic by warping the master camera view into the pixel coordinate system of the slave camera view. Image trajectories of objects detected in the master view can then be transformed into trajectories overlaid on the slave camera view. The slave camera can then compute the pan-tilt angles necessary to keep the object within its zoomed field of view.