Natural Landmark Based Navigation--Visual Tracking

Natural Landmark Based Navigation

Visual Tracking

Updating the feature template

Current Work:

We are currently working on the problem of robustly tracking a natural feature in a stream of images. See the old page for a quick overview of the approach.

Template Update:

The current task is to determine how to update the template during matching and searching. The problem is that as the camera moves relative to a feature which is being tracked, the appearance of the feature changes. In order to robustly track the feature, then, we will need to update our internal representation of the feature.

Original Instance

A naive approach might be to keep the original template first acquired from the camera image in Frame 0. The template used in Frame i, then, is given by

T(i) = T(0).

This is not a good solution because we know that the appearance of the feature will change. The template ends up being matched against an erroneous part of the image and is lost. Figure 1 shows tracking using the original instance of the feature.

NOTE: In Figure 1, (and the other animated figures in this document) the red box bounds the space over which the search is executed for the best match M(i) to the template T(i). The yellow box bounds M(i). If Figure 1 appears to stop moving, you can either reload this page or open the image file itself.

Figure 1: Tracking the original instance of the feature.

Newest Match

Another approach might be to throw out the old template at every step, replacing it with the latest instance of the feature. That way, we can expect that the internal representation of the feature looks most like what it will look like in the next frame. The update rule looks like

T(i+1) = M(i),

where M(i) is the best match to T(i) in Frame i. Figure 2 shows tracking using the latest instance of the feature.

Figure 2: Tracking using the latest instance of the feature.

This approach works better than the first, but problems arise due to the subpixel shifts of the feature on the image plane. If the feature is acquired from location (x,y) in Frame i, then in Frame i+1 the feature may appear at (x+u,y+v) where u and v are not necessarily integers. The subpixel inaccuracies accumulate over the series of several frames and turn into drifts of more than a few pixels. There is no reason for the fractional part of u and v to be correlated over frames and matching takes care of the integer parts of u and v, so in general, the effect is that the feature takes a random walk through template space. It would be advantageous to keep some information about what the feature used to look like, rather than throwing it away.

Affine Transformation

A clever approach has been taken by several researchers (Amidi, Hager) in which T(i+1) is given by the template originally acquired and an affine transformation A(i) as in

T(i+1) = A(i) * T(0).

The template used in the i+1 frame should be as similar as possible to the best match in Frame i, M(i). Therefore, we want A(i) such that

M(i) - A(i) * T(0) = 0

The affine transformation is assumed to be small from frame to frame and is built iteratively, updated after the matching step in each frame, that is

A(i) = A(i-1) + E(i),

where E(i) is the error between the current affine mapping and the estimated correct affine mapping which produces M(i). This works quite well for 2-D objects and orthographic projection (as in tracking the ground from an autonomous helicopter), but for nearby, 3-D, irregularly shaped objects and a perspective camera (as we have here) the approach does not work well. A natural feature does not in general look like an affine transformation of the original appearance of that feature.

Temporal Filtering

The approach I am currently investigating is the use of a temporal filter to update the template at each step. The idea is to use a fixed coefficient, IIR filter on each pixel in the image. The template T(i+1) is built by taking the weighted average of the old template, T(i) and the best match, M(i), as in

T(i+1) = a * T(i) + (1-a) * M(i)

This way, the update can take into account any distortions, including affine transformation. For a=1, the update rule is equivalent to

T(i+1) = T(0),

which was the first approach discussed. For a=0, the update rule is equivalent to

T(i+1) = M(i),

which was the second approach discussed. Empirically, for values of a near unity, but less (say a=0.9), tracking can work quite well. The incorporation of new information allows the distortions of the feature to be accounted for. The retention of old information reduces the effects of drift. Figure 3 shows tracking using the weighted average update rule with a=0.9.

Figure 3: Tracking using a weighted average template update rule.

Evaluating Performance:

How well does the temporal averaging work? Three figures of merit are proposed to evaluate the performance: Correlation, Feature Retention and Drift.

Correlation:
One figure of merit is the value of the normalized correlation between the template and the best match,

< T(i), M(i) >,

where T(i) and M(i) are T(i) and M(i), respectively, after normalizing. Figure 4 shows the normalized correlation between the template and the best match for 100 frames using the Original Template rule (green), Newest Match rule (blue), and the Filtered Template rule (red). The Original Template curve stops at Frame 35 because the feature was lost around Frame 31 and by Frame 35, the search space was off of the image plane. The Newest Match correlation is slightly higher than the Filtered Template correlation on average, which is to be expected.

This experiment was repeated for several features in this video sequence and other sequences and similar results were obtained.

Figure 4: Comparison of correlation between the template T(i) and the best match M(i) over 100 frames.

Feature Retention:
As Figure 1 shows, the feature is only successfully tracked for about 31 frames before the template is matched against some region of the search space other than the actual original feature we wanted. This is not an effect of drift, the feature is tracked reliably for 31 frames and then suddenly jumps far away from the original feature. This number of frames is no magic number, either. Depending on the feature and camera motion, it may be possible to track a feature for a very long time without updating the template: consider tracking a perfect sphere, with no occlusions, where the distance between the camera and sphere does not change. It will always appear the same! Then again, it may be possible to lose the feature much faster, too. Consider tracking a very irregularly shaped 3-D object from a quickly moving train.

For the Newest Match rule amd for the Temporal Filter rule, the feature is retained well throughout the sequence.

Drift:
Figure 5 shows the match in Frame 99 which is identified as being the feature originally acquired in Frame 0 for two update rules. The best match for the Temporal Filter update appears much closer to the original feature than for the Newest Match update rule.

Figure 5: Original feature in Frame 0 (Top center). Tracking results in Frame 99 using Newest Match rule (Bottom left) and Temporal Filter rule (Bottom right).

Notice in Figure 5 that the original feature selected contains the upper left corner of a rock and a section of a piece of wood adjacent to it. The Temporal Filter rule yields a best match M(99) which contains the same corner of the same rock and the adjacent part of the piece of wood. The Newest Match rule yields an accumulated drift of about 10 pixels away from that original feature. This experiment was repeated for several features in this sequence and others and yielded similar results.

Conclusions

Temporal averaging appears to work well for updating the appearance of a feature of interest. Preliminary results indicate that this method has better performance than Original Template or Newest Match rules, based on correlation, feature retention, and drift.

The temporal averaging chosen here, using a fixed coefficient filter, has many advantages to methods proposed other than the peformance as measured by the three figures of merit discussed here.

The temporal filtering can incorporate arbitrary feature distortions into the tracking step, not just effects seen for a 2-D object under orthographic projection.
The filter coefficients can be optimized to yield a frequency response which is appropriate for the task. That is, if the feature should not change quickly, we can keep a large value of a which will cause T(i) to have very small high frequency components. This would make the update less sensitive to noise, occlusions, etc.
The computational complexity of the weighted averaging step is trivial compared to the implementation of the affine mapping update.

Future Plans:

Determine useful figures of merit for tracking, other than the correlation, retention, and drift.
Come up with a more effective way of measuring figures of merit, whether they are those discussed here (correlation, retention, drift) or new ones.
Determine a way to find the best value of a to use for tracking. Hopefully this will be a more-or-less straightforward evaluation of figures of merit for different values of a.
Derive position estimates from tracking measurements. Use the position estimates derived for tracked features to produce an estimate of the rover position.

The individual frames of each video sequence can be found here:

This page is maintained by Matthew Deans, a robograd in the Carnegie Mellon University School of Computer Science.
Comments? Questions? Mail me at deano@ri.cmu.edu
Last Modified March 5, 1997