Next: Event Logic Up: Grounding the Lexical Semantics Previous: Introduction

Model Reconstruction

Certain properties of objects are visible. For example, position, orientation, shape, size, colour, texture, and so forth. Furthermore, relational variants of these properties are also visible, as well as changes in such properties and relations over time. In contrast, force-dynamic properties and relations are not visible. One cannot see the fact that the door knob is attached to, and supported by, the door. One must infer that fact using physical knowledge of the world. Such knowledge includes the fact that unsupported objects fall and attachment is one way of offering support. Using physical knowledge to infer force-dynamic properties and relations was first discussed by [49, 50, 51]. This later became known as the perceiver framework advanced by [29]. The perceiver framework states that perception involves four levels. First, one must specify the observables, what properties and relations can be discerned by direct observation. Second, one must specify an ontology, what properties and relations must be inferred from the observables. Descriptions of the observables in terms of such properties and relations are called interpretations. There may be multiple interpretations of a given observation. Third, one must specify a theory, a way of differentiating consistent interpretations from inconsistent ones. The consistent interpretations are the models of the observation. There may be multiple models of a given observation. Finally, one must specify a preference relation, a way of ordering the models. The most-preferred models of the observations are the percepts. One can instantiate the perceiver framework for different observables, ontologies, theories, and preference relations. [49, 50, 51, 52, 53, 55] instantiated this framework for a kinematic theory applied to simulated video. [34, 35] and [33] instantiated this framework for a dynamics theory applied to real video. [57] instantiated this framework for a kinematic theory applied to real video. This paper uses this later approach.

The input to the model-reconstruction process consists of a sequence of scenes, each scene being a set of convex polygons. Each polygon is represented as a sequence of points corresponding to a clockwise traversal of the polygon's vertices. The tracker guarantees that each scene contains the same number of polygons and that they are ordered so that the polygon in each scene corresponds to the same object. The output of the model-reconstruction process consists of a sequence of interpretations, one interpretation per scene. The interpretations are formulated out of the following primitive properties of, and relations between, the objects in each scene.

: Polygon p is grounded. It is constrained to occupy a fixed position and orientation by an unseen mechanism that is not associated with any visible object and thus cannot move either translationally or rotationally.
: Polygons p and q are attached by a rigid joint at point r. Both the relative position and orientation of p and q are constrained.
: Polygons p and q are attached by a revolute joint at point r. The relative position of p and q is constrained but the relative orientation is not.
: Polygons p and q are on the same layer. Layers are a qualitative representation of depth, or distance from the observer. This representation is impoverished. There is no notion of `in-front-of' or `behind' and there is no notion of adjacency in depth. The only representable notion is whether two objects are on the same or different layers. The same-layer relation is constrained to be an equivalence relation, i.e. it must be reflexive, symmetric, and transitive. Furthermore, two objects on the same layer must obey the substantiality constraint, the constraint that they not interpenetrate [59, 8, 6, 7, 60, 61].

An interpretation I is a 4-tuple: . Throughout this paper, interpretations will be depicted graphically, overlayed on scene images, for ease of comprehension. Figure 7 gives a sample interpretation depicted graphically. The symbol ` ' attached to a polygon indicates that it is grounded. A solid circle indicates that two polygons are rigidly attached at the center of the circle. A hollow circle indicates that two polygons are attached by a revolute joint at the center of the circle. The same-layer relation is indicated by giving a layer index, a small nonnegative integer, to each polygon. Polygons with the same layer index are on the same layer, while those with different layer indices are on different layers.

Figure: The graphical method for depicting interpretations that is used in this paper. The symbol ` ' indicates that a polygon is grounded. A solid circle indicates a rigid joint. A hollow circle indicates a revolute joint. Two polygons with the same layer index are on the same layer.

Model reconstruction can be viewed as a generate-and-test process. Initially, all possible interpretations are generated for each scene. Then, inadmissible and unstable interpretations are filtered out. Admissibility and stability can be collectively viewed as a consistency requirement. The stable admissible interpretations are thus models of a scene. The nature of the theory guarantees that there will always be at least one model for each scene, namely the model where all objects are grounded. There may, however, be multiple models for a given scene. Therefore, a preference relation is then applied through a sequence of circumscription processes [38] to select the minimal, or preferred, models for each scene. While there will always be at least one minimal model for each scene, there may be several, since the preference relation may not induce a total order. If there are multiple minimal models for a given scene, one is chosen arbitrarily as the most-preferred model for that scene. The precise details of the admissibility criteria, the stability checking algorithm, the preference relations, and the circumscription process are beyond the scope of this paper. They are discussed in [57]. What is important, for the purpose of this paper, is that, given a scene sequence, model reconstruction produces a sequence of interpretations, one for each scene, and that these interpretations are 4-tuples containing the predicates , , , and . Figure 4 shows sample interpretation sequences produced by the model-reconstruction component on the scene sequences from Figure 3.

Next: Event Logic Up: Grounding the Lexical Semantics Previous: Introduction

Jeffrey Mark Siskind
Wed Aug 1 19:08:09 EDT 2001