Yan Ke, Rahul Sukthankar, and Martial Hebert
The amount of digital video data has grown exponentially in recent years due to the increasing affordability of digital consumer video cameras, large-scale deployment of video surveillance systems, ease of digital content creation, and availability of high-speed networks and high-capacity storage devices. Unfortunately, the technology for searching, indexing, and retrieving video content has failed to keep pace. Similarly, manual organization and annotation of this content is becoming infeasible. The goal of this project is to develop algorithms and tools to enable near real-time processing of large-scale (100 TB) video collections. We focus on the early stages of processing, namely the extraction of spatio-temporal features corresponding to interesting events.
We argue that video should be thought of as three dimensional volumes, and thus the fundamental processing unit should be 3D blocks consisting of many frames, instead of on a frame by frame basis. Only recently have researchers begun to simultaneously process blocks of frames of video [2, 6, 7]. Just as researchers have decomposed images into their constituent shapes and used 2D shape descriptors for analysis [1, 4], video can be thought as groups of 3D volumes. There are several advantages to jointly analyzing a video’s space and time dimensions. First, spatial and temporal consistency can be easily maintained. Second, instead of analyzing pixels over many frames, higher level algorithms can focus on large, sparse regions for improved efficiency. Finally, the appearance and motion of objects in the scene can be jointly modeled, which can potentially lead to better recognition results.
Here, we show a video segment decomposed into 3D volumes.
Frame 1 | Frame 2 | Frame 3 | Frame 4 | Frame 5 |
Shirt | Pants | Legs | Ball | Composite |
This shows how we transform action recognition in video into a 3D shape matching problem. Notice the distinctive shapes that appear as people move.
Sequence Start | Sequence End | 3D Volumetric Representation | |
Handwave Sequence from Schult's Dataset | |||
Ballet Sequence |
By representing actions as spatio-temporal events, we have extended Viola-Jones' Adaboost framework [5] to recognize actions [3]. By learning a set of 3D box features on the integral volume, we are able to detect events such as "sit down", "hand wave", "grab cup", etc. Below we show our detector recognizing the "grab-cup" event. Note that our detector is localized in both space and time.
We show some preliminary results from our current experiments on a ballet sequence. We manually select a template that consists of a ballet dancer holding a particular pose. "Template Regions" shows an over segmentation of the volume using mean shift. Using the template volume, we match against other parts of the video.
Below we show key frames taken every 0.5 seconds in the video. The first frame shows the template, and the matched poses are labeled. Note that because the segmentation can be done once and saved, we do not need to do the segmentation online. Further, there is no need run a sliding window through the video given the segmentation, and thus matching and retrieval can be done very quickly.