LEARNING AND UNDERSTANDING ACTION
IN VIDEO IMAGERY

Research Groups

MIT Media Laboratory: Vision and Modeling Group

Center for Biological and Computational Learning

Principal Investigators

Aaron Bobick

Sandy Pentland

Tommy Poggio

Technical Area: Video Surveillance and Monitoring (VSAM)

Technical Agent

Jonathan Philips at ARL

Project Details

Technical Objectives

Military Relevance

Approach

Relevant Publications

Imagery

Related Links

Technical Objectives

The research focus of this program is the understanding of action in video imagery. Unlike the processing of static imagery, where the fundamental question is ``what is in the scene,'' the interpretation of video sequences focuses on what is happening in the scene. Our goal is to develop new representations of action and of the appearance of action, and then embed those representations within appropriate recognition paradigms.

The expected contributions of our work under this program are based in part on the following innovative claims and insights:

Video is fundamentally different than a collection of images because video sequences contain action. Representations and methods need to be developed to explicitly consider the interpretation of action. Paradoxically, the interpretation of video imagery is often simpler than static images because the requirement for temporal consistency is a constraint that can be exploited.
Tracking of human body motion can be accomplished using a statistical model much less computational expensive than a full kinematic representation.
Many actions can be recognized in extremely blurred imagery, arguing against 3-dimensional recognition. Rather, appearance-based methods seem more robust and feasible.
The 2-dimensional trajectories of 3-dimensional points in motion vary slowly and parametrically with view angle. These 2-dimensional patterns are easily learned and their deformations expressed analytically as a combination of stored patterns.
Time can be modeled explicitly using standard temporal statistical techniques such as Hidden Markov Models. These techniques are well suited to understanding actions such as gesture, or other temporal behaviors such as driving a car.
Much of action recognition is not based upon geometry but relations over time, and in a particular context. For understanding action in constrained domains, explicit reasoning about context facilitates selecting appropriate vision routines. Furthermore, approximate world models, known to be inaccurate but adequate to select view based routines, allow for the integration of additional qualitative information.

Under the work performed we will demonstrate fundamental capabilities in a variety of action understanding scenarios. During the first two years (``base period'') we will develop the initial technologies. Demonstrations will include recognizing human actions from appearance in an indoor environment and the use of temporal-statistical techniques for the detection of behaviors. Planned for the three ``option years'' is a more integrated approach where the developed approaches are tested and evaluated in an integrated environment.

Military Relevance

There are numerous intelligence and several battlefield scenarios that would directly benefit from automatic and semi-automatic action understanding systems. Here we list a small sampling along with the technologies necessary to realize the capabilities.

Exterior surveillance: need to distinguish allowed activities from restricted ones.
Interior surveillance: need to recognize unusual activities, such as people removing objects from other people's rooms when unoccupied, or frequent meet-then-separate (e.g., potential ``hand-offs'') between people who are not co-workers.
Behavior anomalies: Identify vehicles moving in an unusual manner, e.g., drivers who are lost (don't know where they are going), under the influence of drugs, or vehicles that are unusually loaded.
Sequential anomalies: Consider the scenario where someone drives a van up to an edifice and then does not enter the building. Only a system doing temporal tracking and analysis could identify this situation.
Port monitoring: The activity of a fleet (say re-fueling) is determined by a particular temporal ordering of events, not the precise duration appearance of most of the components.
Battlefield Surveillance: Tracking possibly non-cooperative entities requires a tracking mechanism that can exploit knowledge about the items being tracked and the background which isn't.

Technical Approach

In describing our approach we divide the technologies to be developed into those concerned with the human-form and action executed by a single (or small number of) individual(s), and those focused on the general problem of detecting or recognizing actions in video imagery.

Understanding Human Motion

Certainly a large fraction of the situations in which important action occurs contain people. Over the past several years we have developed significant technologies for modeling and recognizing human actions.

Model-based whole-body action

In the area of whole-body motion we have pursued two fundamental tracks. Under Sandy Pentland, we have developed technologies for both the initial acquisition and subsequent tracking of significant body features. Pfinder is a system that uses a multi-class statistical model of color and shape to segment a person from the background scene, and then to find and track people's heads and hands in a wide variety of viewing conditions. With either multiple cameras or a structured environment, three-dimensional information can be provided as well.

Presupposing the existence of 3-dimensional information, Aaron Bobick has focused on methods of interpreting of action. For whole-body action the most relevant work is the use of phase-space constraints to detect actions. The basic idea is to learn during training that certain relationships between body parts are in effect only during a particular action. Then, recognizing action reduces to detection of which constraints hold over a given time interval. We have used this technology to successfully recognize 9 fundamental ballet movements from 3D data. The important point is that the set of phase-constraints is the representation of the action; this is as opposed to a particular space-time trajectory.

We note that recently we have integrated these areas in constructing a system that automatically tracks and recognizes the gestures of Tai Chi.

We intend to extend the tracking and modeling to handle outdoor situations with significantly less controlled environments. Our goal is to use multiple cameras and dynamic constraints (e.g. temporal smoothness and kinematic plausibility) to enhance the robustness of the algorithms. We also plan to incorporate statistical decision making into the phase-space constraint approach, including the HMMs to be discussed below.

Appearance-based action recognition

The above work focuses on recognizing actions from geometrical models. However, if one takes a video sequence of someone performing an action (say, sitting down) and blurs it mercilessly (each frame to 15 by 20 pixels) the action is still immediately apparent when the frames are put in motion.

Blurred sequence (1.3M Quicktime)

Blurred sequence (0.1M MPEG)

This is the case even though there are no discernible features in each individual frame. This simple demonstration indicates that geometric modeling is not necessary to recognize action. And given the difficulties present in computing the 3D structure, it might not even be desirable.

Recently, we have begun to develop appearance-based methods of recognizing action. We leave the discussion of statistically- and HMM-based techniques for the next section. Here we want to emphasize some recent work that focuses on motion patterns varying over time as an index for action recognition. The basic idea is to separate where motion is happening in the image (i.e. the shape of the motion field) from how the motion is moving (i.e. the movement of the motion field). By using simple motion differencing operations we create a mask image; statistical moments describing that shapes are used as an index into stored models of action. From the model we retrieve how the motion varies over time, and then test for agreement. The procedure is fast and entirely appearance-based. For more information go here.

Temporal structure and gesture recognition

Our last two sections on understanding human motion concerns recognizing human actions based upon particular spatio-temporal structure. Our early work on such areas focused on gesture recognition, and in particular methods based upon modeling the variation of the spatial configuration over time. The fundamental idea was that a gesture could be modeled by a sequence of states: the states would be defined in training data, and testing reduced to a dynamic programming search to find the most similar gesture.

The natural progression from this work was to explore the use of Hidden Markov Models for describing gesture. One innovative technique we introduced allowed for different features to be measured for each state: the basic idea is that no single representation may be valid for an entire action and the ``right'' features to measure may be different at different phases of a gesture.

Our most recent work has once again moved away from HMMs and back to explicit (or visible) states. The main idea is that for gesture it is often the case that the temporal characteristics of the desired action are known and that one would like to devise a parsing mechanism capable of segmenting such gestures from incoming video. We have demonstrated an approach which allows us to parse natural gestures generated by someone telling a story. The system is able to identify important or meaningful gesture based upon the temporal structure. Under this project we will develop more fully the idea of temporal structure modeling. Our principal observation is that often the temporal nature of an action is better specified than the spatial configuration. One potential application is a coding system where the most meaningful gestures are granted a greater share of the video bandwidth.

As a final note we mention our work on recognizing American Sign Language. Although this may not be considered natural gesture, it is a grammar controlled action, much like the assembly of a device or the unloading of particular type of object: Part A must be raised before Part B can be extracted. We intend to explore the use of HMMs in the understanding of such temporally structured gestures.

Understanding Activities in Context

The previous section focused on understanding the action of a single individual, and considered situations where the motion of the action was adequate to characterize the action itself. But much action understanding is not so isolated. Consider the football play mentioned in the introduction. It is not the action of a single entity that defines the action. Rather, these actions are activities which need to be considered in context. Here we relate some of our fundamental work in considering activities and how one might visually detect and recognize them.

Seeing action in context

Perhaps the most difficult aspect of understanding action from video is that the action defines the visual context. For example, if someone is cutting something, then the best method to find his hands in the imagery might be quite different than the appropriate technique for a conversational situation. The difficulty, of course, is that the action defines the context, but that the context established the best way to see during the action.

Our recent work on approximate world modeling is designed to address this problem. The basic idea is to use some potentially inaccurate but universally applicable general purpose vision routines to try to establish an approximate model of the world. This model, in turn is then used to establish the context and select the best vision routine to perform a given task. A fundamental innovation of this work is that approximate models can be augmented by extra-visual, contextual information. For example, a linguistic description might be available indicating an approximate position for an object. Because we assume a potentially inaccurate world model, that information can be incorporated directly.

Under this project we intend to extend our initial work on action recognition within this framework. We are already using a simple inference system that draws implications about visual features that might be present during a given action in a given context. Our goal is to extend this work by deriving a ``Past-Now-Future'' calculus (derived from Allen's temporal interval algebra) that would allow the system to reason about sequences of events that constitute an activity.

Detecting anomolous behavior and interactions

A fundamental problem to be addressed in any surveillance or monitoring scenario is that of information filtering: deciding whether a scene contains an activity or behavior worth analyzing. Our approach to the detection of such situations is to generate a statistical descrption of typical behavior such that typical scenarios are seen as being highly probable as being drawn from that distribution. With tuning, these same descriptions will cause much less typical situations to evaluate with a relatively lower probability.

We have begun to generate the tracking and trajectory mechanisms necessary to consrtuct the statistical descrption of activity. Also, since we are interested in interactions, we will employ the coupled HMMs we developed earlier to represent the statistical coupling between two people who are aware of each's presence. The goal is that the coupled HMMs will be sensitive enough to the pair's behavior that we can use it to detect "atypical interactions."


Single frame from pedestrian scene	Segmented people	Tracked trajectories for behavior system

Temporal behaviors: driving

Automobile drivers' intended action (e.g., to turn, change lanes, brake, etc.) can be inferred by observing their control inputs (steering and acceleration) as they prepare to execute the action. Actions are modeled as a sequence of internal mental states, each with a characteristic pattern of driver control behavior; this is similar to the hidden Markov modeling discussed above. By observing the temporal pattern of the drivers' control inputs and comparing to the action models, we can determine which action the drivers are beginning to execute. In the case of driving the actions are events like turning left, stopping, or changing lanes. The internal states are the individual steps that make up the action, and the observed behaviors will be changes in steering angle and acceleration/braking of the car.

Even apparently simple driving actions can be broken down into a long chain of simpler sub-actions. A lane change, for instance, may consist of the following steps (1) a preparatory centering the car in the current lane, (2) looking around to make sure the adjacent lane is clear, (3) steering to initiate the lane change, (4) the change itself, (5) steering to terminate the lane change, and (6) a final recentering of the car in the new lane. Under this project we will statistically characterize the sequence of steps within each action, and use the first few preparatory steps to identify which action is being initiated. Initial pilot studies indicate that driver's patterns are quite predictable and it is possible to both know almost instantly when a drive is going to turn in a particular direction, and to know if the control is following normal statistical patterns.

Some initial results:


Single frame from vehicle scene	Tracked trajectories for behavior system

Example-based detection of objects and activity in video

Object detection in video is a natural extension of the problem of finding objects in single images. The need for robust, configurable systems that can locate objects in images is rapidly increasing due to the explosion in the amount of visual data available; searching and indexing this data is currently expensive.

Example-based learning techniques have proven to be successful in a wide variety of areas, from data mining to face detection in cluttered scenes. Through the use of examples, these algorithms avoid the need to explicitly model the object being searched for; rather, the model of the object is implicit in the patterns automatically learned by the algorithm. Thus, we feel this class of algorithms is well suited to the task we wish to investigate, in that their use will make the object detection system applicable to a wider variety of domains.

We have already developed a pedestrian detection system for static imagery. The basic approach is to construct a statstical basis set of the appearance of pedestrians in a scene and then to estimate the probability that a given region of an image a view of a person.


Statistical training architecture	People found in static imagery.

Clearly, one of the most important steps in detecting objects in video is to localize where motion is occuring in a frame. The simplest technique is to use change detection, or the differencing of consecutive frames of video, to see where motion has occured. Another, more sophisticated, approach to analyzing motion in video is represented by the class of optical flow algorithms. These algorithms automatically compute pixel-wise correspondences between grey level images. The information provided by the optical flow algorithms is more detailed than simple change detection and will be useful in subsequent steps of detecting objects. By integrating detection information across multiple frames, the system should exhibit a lower signal to noise ratio than by using isolated frames.

To actually determine what is and what is not a person, a classifier will be trained to recognize various measurements of people, including dynamic information not available in static images:

shape of the moving region;
existence and location of appendages;
speed of motion;
motion measurements at the bottom and top of the pattern to detect leg motion

All these measurements will be combined into a feature vector that will be used to train a classifier to differentiate the Person from the Non-Person classes. To classify objects in this system, we will be using the support vector machine (SVM) classification technique developed by Vapnik, 1995. The support vector algorithm uses structural minimization to find the hyperplane that optimally separates two classes of objects; this is equivalent to minimizing a bound on generalization error.

A significant challenge to this approach is that the core pattern detection technique that we develop will have to be able to be applied to finding objects from a wide variety of dissimilar classes. Also, people, unlike faces, are non-rigid objects and therefore the space of patterns may be huge; the hope is that the support vector machinery will be able to recover this decision surface. The lack of a sufficient number of examples for this difficult, high dimension classification problem needs to be addressed as well. Furthermore, the fact that we are dealing with images in video that tend to be more noisy than single images will undoubtedly cause difficulties to arise in processing the data. We hope that certain preprocessing steps will lessen the negative impact of this noise.

Knowledge-based tracking

A fundamental problem in understanding activities in video is simply keeping track if the individual entities. Much work in tracking only applies to situations with static backgrounds and non-occluding objects; unfortunately, only occasionally do such situations arise. Typically tracking requires consideration of complicated environments with difficult visual scenarios.

To address this problem we have developed a closed-world tracking technique that exploits local context to track objects. The fundamental idea is that one is not tracking an object against an unwanted, unknown distractor. Rather, there are no distractors. All objects and background must be tracked. The advantage is that in this situation it is possible to design custom trackers specifically suited to resolve a current ambiguity. We have initially tested this work in the the football domain with reasonable success. But to apply this technique to other domains we need to transform it from a fragile experimental system to a robust portable one. Under this project we would continue our development of the closed world tracking technology including developing a stand-alone system that could be easily included in other systems.

Publications

Nonlinear Parametric Hidden Markov Models, Andrew Wilson and Aaron Bobick, to appear ICCV, Bombay, India, 1998

Human Action Detection using PNF Propagation of Temporal Constraints, Claudio Pinhanez and Aaron Bobick, MIT Media Lab PerCom TR 423, 1997

Dynamic Modeling of Human Motion, Chris Wren and A. Pentland, MIT Media Laboratory PerCom TR 415, 1997

"Linear Object Classes and Image Synthesis from a Single Example Image," (T. Vetter and T. Poggio). IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol 19, No. 7, July 1997, 733-742.

"Comparing Support Vector Machines with Gaussian Kernels to Radial Basis Function Classifiers," (B. Schoelkopf, K. Sung, C. Burges, F. Girosi, P. Niyogi, T. Poggio and V. Vapnik). Special Issue of IEEE Trans Signal Processing, in press.

"Multidimensional Morphable Models: A Framework for Representing and Matching Object Classes," (M. Jones and T. Poggio). International Journal of Computer Vision, submitted 1997.

Coupled HMMs for Complex Action Recognition, Brand, M., Oliver, N., and Pentland, A., CVPR, San Juan, Puerto Rico, 1997.

Movement, Activity, and Action: The Role of Knowledge in the Perception of Motion, Aaron Bobick, Philo. Proc. of Royal Society, 1997.

The Representation and Recognition of Action Using Temporal Templates James W. Davis and Aaron F. Bobick, IEEE CVPR (and MIT Media Lab TR #402), San Juan, Puerto Rico, June 1997.

Recovering the Temporal Structure of Natural Gesture, Andrew Wilson, Aaron Bobick and Justine Cassell, CVPR, San Juan Puerto Rico, 1997

Real-time Recognition of Activity Using Temporal Templates Aaron F. Bobick and James W. Davis, Workshop on Applications of Computer Vision (and MIT Media Lab TR #386) 1996

An Appearance-based Representation of Action Aaron F. Bobick and James W. Davis, International Conference on Pattern Recognition (and MIT Media Lab Percom TR #369) 1996

"Role of Learning in Three-dimensional Form Perception," (P. Sinha and T. Poggio). Nature, Vol. 384, No. 6608, 460-463, 1996.

"Image Representation for Visual Learning," (D. Beymer and T. Poggio). Science, 272, 1905-1909, 1996.

"Example-based Learning for View-based Human Face Detection," (K.-K. Sung and T. Poggio). In: Proceedings from Image Understanding Workshop, November 13-16, 1994 (Morgan Kaufmann, San Mateo, CA), 843-850

"Learning and Vision," (T. Poggio and D. Beymer). In: Early Visual Learning, S. Nayar and T. Poggio (eds.), Oxford University Press, 43-66, 1996.