Model reconstruction determines the truth values of the force-dynamic relations on a frame-by-frame basis in the input movie. Intervals of constant truth value for a given force-dynamic relation are taken to be primitive event occurrences. LEONARD uses event logic to infer compound event occurrences from primitive event occurrences. For example, for the image sequence in Figure 1(a), model reconstruction determines that the green block supports the red block from Frame 0 to Frame 13 and that the hand is attached to the red block from Frame 13 to Frame 20. This will be denoted as and , i.e. that the primitive event types and occurred during the intervals [0,13) and [13,20) respectively. The compound event type might be defined as
i.e. followed by . (In the above, I use `;' informally as a sequence operator. The precise definition of `;' will be given momentarily.) The task of the event-logic inference procedure is to infer , i.e. that the compound event type occurred during the interval [0,20).
Event logic provides a calculus for forming compound event types as expressions over primitive event types. The syntax and semantics of event logic will be described momentarily. Event-logic expressions denote event types, not event occurrences. As such, they do not have truth values. Rather, they are predicates that describe the truth conditions that must hold of an interval for an event to occur. In contrast, an event-occurrence formula does have a truth value. If is an event-logic expression that denotes a primitive or compound event type, and is an interval, then is an atomic event-occurrence formula that is true if and only if the truth conditions for the event type hold of the interval .
denotes coincidental occurrence, the fact that an occurrence of started at the beginning of and finished at the end of . would not hold if an occurrence of did not precisely coincide with , but instead overlapped, partially or totally, with . Event types have internal temporal structure that render this distinction important. In the case of primitive event types, that structure is simple. Each primitive event type is derived from a predicate. If is a predicate, then denotes the primitive event type derived from . A primitive event type holds of an interval if the corresponding predicate holds of every instant in that interval. This means that and might have different truth values. For example, if is true of every instant in [0,2) and false of every other instant, then is true while is false. Event logic takes coincidental occurrence to be a primitive notion. As will be demonstrated below, overlapping occurrence is a derived notion that can be expressed in terms of coincidental occurrence using compound event-logic expressions.
Two auxiliary notions are needed to define the syntax and semantics of event logic. First, there are thirteen possible relations between two intervals. Following [4], I denote these relations as =, <, >, , , , , , , , , , and and refer to them collectively as Allen relations throughout the paper. The names , , , , and are mnemonics for meet, overlap, start, finish, and during respectively. The inverse relations, such as , whose names end in are the same as the corresponding relations, such as , whose names do not end in except that the arguments are reversed. Figure 8 depicts all thirteen Allen relations graphically. Second, I define the span of two intervals and , denoted , as the smallest super-interval of both and .
Figure 8: The Allen relations, the thirteen possible relations between two
intervals.
The syntax of event logic is defined as follows. We are given finite disjoint sets of constant symbols along with a finite set of primitive event-type symbols, each of a specified arity. Constant symbols, such as and , denote objects in the input movie while primitive event-type symbols, such as , denote parameterized primitive event types. An atomic event-logic expression is a primitive event-type symbol of arity n applied to a sequence of n constants. For example, . An event-logic expression is either an atomic event-logic expression or one of the compound event-logic expressions , , , or , where and are event-logic expressions and .
Informally, the semantics of compound event-logic expressions is defined as follows:
Formally, the truth of an atomic event-occurrence formula is defined relative to a model. Let I be the set of all intervals and O be the set of all objects in the movie. A model M is a map from primitive event-type symbols of arity n to subsets of . (M can be viewed as either a model or as a movie.) M thus maps primitive event-type symbols to relations that take an interval as their first argument, in addition to the remaining object parameters. The semantics of event logic is formally defined by specifying an entailment relation as follows:
Figure 9 shows the primitive event types currently used by LEONARD. The definitions in Figure 9 are formulated in terms of the predicates , , , and that are produced by model reconstruction as described in Section 2. An object is supported if it is not grounded. Two objects contact if their polygons touch and they are on the same layer. Two polygons p and q touch, denoted , if they intersect and their intersection has zero area. Two objects are attached if there is a rigid or revolute joint that joins them. Determining whether an object x supports an object y is a little more complex and requires counterfactual reasoning. Let be the set of polygons in a scene, be the most-preferred model of the scene as produced by model reconstruction, and be true if the scene is stable under an interpretation I. Stability analysis, i.e. the predicate, is a subroutine used by the model-reconstruction component. An object x supports an object y if y is not grounded in the most-preferred model I and a variant of with x removed is not stable under a variant of I where all objects except for y and those rigidly attached, directly or indirectly, to y are grounded. In Figure 9, denotes the variant of with x removed, denotes the fact that x is rigidly attached to y, denotes the reflexive transitive closure of the relation, denotes the set of objects that are rigidly attached, directly or indirectly, to y, and denotes a variant of I where all objects except for y and those rigidly attached, directly or indirectly, to y are grounded.
Figure 9: Definition of the primitive event types used by LEONARD.
Figure 10 shows the compound event-type definitions currently used by LEONARD. denotes an event type where x picks y up off of z. It is specified as a sequence of three intervals, where x is not attached to and does not support y in the first interval but is attached to and does support y in the third interval. Additionally, z supports y in the first interval but does not support y in the third interval. Furthermore, several conditions must hold in both the first and third intervals: x must be unsupported, y must not support either x or z, x and z must not support each other, and y must not be attached to z. During the second interval, intermediate between the first and third intervals, either x is attached to y or y is attached to z. Additionally, several conditions must hold throughout the entire event: x, y, and z must be distinct and y must be supported. denotes an event type where x puts y down on z. It is specified in a fashion that is similar to but where the three subevents occur in reverse order. denotes an event type where w puts x down on y which is resting on z. It is specified as , where z supports but is not attached to y and z is distinct from w, x, and y. denotes an event type where w picks x up off of y which is resting on z. It is specified as , where z supports but is not attached to y and z is distinct from w, x, and y. denotes an event type where w picks x up off of y and puts it down on z which is distinct from y. denotes an event type where w first puts y down on z then sometime later stacks x on top of y. Finally, denotes an event type where w first unstacks x from on top of y (which is resting on z) and then sometime later picks y up off of z. Figure 1 shows sample movies depicting occurrences of the event types and . Figures 11 through 15 show sample movies depicting occurrences of the event types , , , , and respectively.
Figure 10: The lexicon of compound event types used by LEONARD.
Figure 11: An image sequence depicting a stack event.
Figure 12: An image sequence depicting an unstack event.
Figure 13: An image sequence depicting a move event.
Figure 14: An image sequence depicting an assemble event.
Figure 15: An image sequence depicting a disassemble event.
Nominally, all atomic event-logic expressions are primitive event types. However, we allow giving a name to a compound event-logic expression and using this name in another event-logic expression as short hand for the named expression with appropriate parameter substitution. This is simply a macro-expansion process and, as such, no recursion is allowed. This feature is used in Figure 10 to define , , and in terms of ; , , and in terms of ; in terms of , which is itself defined in terms of ; and in terms of , which is itself defined in terms of .
The overall goal of the event-classification component is to infer all occurrences of a given set of compound event types from a given set of primitive event occurrences. The model-reconstruction component combined with the primitive event-type definitions given in Figure 9 produces a set of primitive event occurrences for a given scene sequence. Figure 10 lists parameterized compound event types. These are instantiated for all tuples of objects in the scene sequence to yield ground compound event-logic expressions. The event-classification component infers all occurrences of these compound event types that follow from the set of primitive event occurrences. Let us define to be . The model-reconstruction component combined with the primitive event-type definitions given in Figure 9 produces M. Instantiating the parameterized compound event types from Figure 10 for all object tuples yields a set of event-logic expressions. The event-classification component computes for every in this set.
In principle, could by implemented as a straightforward application of the formal semantics for event logic as specified above. There is a difficulty in doing so, however. The primitive event types have the property that they are liquid. Liquid events have the following two properties. First, if they are true during an interval , then they are also true during any subinterval of . Second, if they are true during two overlapping intervals and , then they are also true during and any subinterval of . For example, if an object is supported during [1,10], then it also is supported during [2,5], [3,8], and all other subintervals of [1,10]. Similarly, if an object is supported during [1,5] and [4,10], then it also is supported during [1,10] and all of its subintervals. [48] introduced the notion of liquidity and [68], [22], [69], and [30] have observed that many event types have this property. Because the primitive event types are liquid, they will hold over an infinite number of subintervals. This renders the formal semantics inappropriate for a computational implementation. Even if one limits oneself to intervals with integral endpoints, the primitive event types will hold over quadratically many subintervals of the scene sequence. Furthermore, a straightforward computational implementation of the formal semantics would be inefficient, because it requires quantifying over subintervals to implement and quantifying over pairs of subintervals to implement . The central result of this paper is a novel representation, called spanning intervals, that allows an efficient representation of the infinite sets of subintervals over which liquid event types hold along with an efficient inference procedure that operates on that representation. This representation, and the inference procedure that implements , are presented in the next section.