Certain properties of objects are visible. For example, position, orientation, shape, size, colour, texture, and so forth. Furthermore, relational variants of these properties are also visible, as well as changes in such properties and relations over time. In contrast, force-dynamic properties and relations are not visible. One cannot see the fact that the door knob is attached to, and supported by, the door. One must infer that fact using physical knowledge of the world. Such knowledge includes the fact that unsupported objects fall and attachment is one way of offering support. Using physical knowledge to infer force-dynamic properties and relations was first discussed by [49, 50, 51]. This later became known as the perceiver framework advanced by [29]. The perceiver framework states that perception involves four levels. First, one must specify the observables, what properties and relations can be discerned by direct observation. Second, one must specify an ontology, what properties and relations must be inferred from the observables. Descriptions of the observables in terms of such properties and relations are called interpretations. There may be multiple interpretations of a given observation. Third, one must specify a theory, a way of differentiating consistent interpretations from inconsistent ones. The consistent interpretations are the models of the observation. There may be multiple models of a given observation. Finally, one must specify a preference relation, a way of ordering the models. The most-preferred models of the observations are the percepts. One can instantiate the perceiver framework for different observables, ontologies, theories, and preference relations. [49, 50, 51, 52, 53, 55] instantiated this framework for a kinematic theory applied to simulated video. [34, 35] and [33] instantiated this framework for a dynamics theory applied to real video. [57] instantiated this framework for a kinematic theory applied to real video. This paper uses this later approach.
The input to the model-reconstruction process consists of a sequence of
scenes, each scene being a set of convex polygons.
Each polygon is represented as a sequence of points corresponding to a
clockwise traversal of the polygon's vertices.
The tracker guarantees that each scene contains the same number of polygons
and that they are ordered so that the polygon in each scene
corresponds to the same object.
The output of the model-reconstruction process consists of a sequence of
interpretations, one interpretation per scene.
The interpretations are formulated out of the following primitive properties
of, and relations between, the objects in each scene.
An interpretation I is a 4-tuple:
.
Throughout this paper, interpretations will be depicted graphically, overlayed
on scene images, for ease of comprehension.
Figure 7 gives a sample interpretation depicted
graphically.
The symbol `
' attached to a polygon indicates that it is
grounded.
A solid circle indicates that two polygons are rigidly attached at the center
of the circle.
A hollow circle indicates that two polygons are attached by a revolute joint
at the center of the circle.
The same-layer relation is indicated by giving a layer index, a
small nonnegative integer, to each polygon.
Polygons with the same layer index are on the same layer, while those with
different layer indices are on different layers.
Figure: The graphical method for depicting interpretations that is used in
this paper.
The symbol ` ' indicates that a polygon is
grounded.
A solid circle indicates a rigid joint.
A hollow circle indicates a revolute joint.
Two polygons with the same layer index are on the same layer.
Model reconstruction can be viewed as a generate-and-test process.
Initially, all possible interpretations are generated for each scene.
Then, inadmissible and unstable interpretations are filtered out.
Admissibility and stability can be collectively viewed as a consistency
requirement.
The stable admissible interpretations are thus models of a scene.
The nature of the theory guarantees that there will always be at least one
model for each scene, namely the model where all objects are grounded.
There may, however, be multiple models for a given scene.
Therefore, a preference relation is then applied through a sequence of
circumscription processes [38] to select the minimal, or
preferred, models for each scene.
While there will always be at least one minimal model for each scene, there
may be several, since the preference relation may not induce a total order.
If there are multiple minimal models for a given scene, one is chosen
arbitrarily as the most-preferred model for that scene.
The precise details of the admissibility criteria, the stability checking
algorithm, the preference relations, and the circumscription process are
beyond the scope of this paper.
They are discussed in [57].
What is important, for the purpose of this paper, is that, given a scene
sequence, model reconstruction produces a sequence of interpretations, one for
each scene, and that these interpretations are 4-tuples containing the
predicates ,
,
, and
.
Figure 4 shows sample interpretation sequences
produced by the model-reconstruction component on the scene sequences from
Figure 3.