Volumetric Descriptors for Efficient Video Analysis

NSF Grant IIS-0534962

Yan Ke, Rahul Sukthankar, and Martial Hebert

Motivation

The amount of digital video data has grown exponentially in recent years due to the increasing affordability of digital consumer video cameras, large-scale deployment of video surveillance systems, ease of digital content creation, and availability of high-speed networks and high-capacity storage devices. Unfortunately, the technology for searching, indexing, and retrieving video content has failed to keep pace. Similarly, manual organization and annotation of this content is becoming infeasible. The goal of this project is to develop algorithms and tools to enable near real-time processing of large-scale (100 TB) video collections. We focus on the early stages of processing, namely the extraction of spatio-temporal features corresponding to interesting events.

 

Approach

We argue that video should be thought of as three dimensional volumes, and thus the fundamental processing unit should be 3D blocks consisting of many frames, instead of on a frame by frame basis. Only recently have researchers begun to simultaneously process blocks of frames of video [2, 6, 7]. Just as researchers have decomposed images into their constituent shapes and used 2D shape descriptors for analysis [1, 4], video can be thought as groups of 3D volumes. There are several advantages to jointly analyzing a video’s space and time dimensions. First, spatial and temporal consistency can be easily maintained. Second, instead of analyzing pixels over many frames, higher level algorithms can focus on large, sparse regions for improved efficiency. Finally, the appearance and motion of objects in the scene can be jointly modeled, which can potentially lead to better recognition results.

 

Here, we show a video segment decomposed into 3D volumes.

Frame 1 Frame 2 Frame 3 Frame 4 Frame 5
         
Shirt Pants Legs Ball Composite

 

This shows how we transform action recognition in video into a 3D shape matching problem.  Notice the distinctive shapes that appear as people move.

  Sequence Start Sequence End 3D Volumetric Representation
Handwave Sequence from Schult's Dataset
Ballet Sequence

Results

By representing actions as spatio-temporal events, we have extended Viola-Jones' Adaboost framework [5] to recognize actions [3].  By learning a set of 3D box features on the integral volume, we are able to detect events such as "sit down", "hand wave", "grab cup", etc.  Below we show our detector recognizing the "grab-cup" event.  Note that our detector is localized in both space and time.

 

We show some preliminary results from our current experiments on a ballet sequence.  We manually select a template that consists of a ballet dancer holding a particular pose.  "Template Regions" shows an over segmentation of the volume using mean shift.  Using the template volume, we match against other parts of the video.

Below we show key frames taken every 0.5 seconds in the video.  The first frame shows the template, and the matched poses are labeled.  Note that because the segmentation can be done once and saved, we do not need to do the segmentation online.  Further, there is no need run a sliding window through the video given the segmentation, and thus matching and retrieval can be done very quickly.

References

  1. S. Belongie, J. Malik, and J. Puzicha. Shape matching and object recognition using shape contexts. IEEE
    Transactions on Pattern Analysis and Machine Intelligence, 2002.
  2. D. DeMenthon and D. Doermann. Video retrieval of near-duplicates using k-nearest neighbor retrieval of spatiotemporal descriptors. Multimedia Tools and Applications, 2005.
  3. Y. Ke, R. Sukthankar, and M. Hebert. Efficient Visual Event Detection using Volumetric Features.  In Proceedings of International Conference on Computer Vision, 2005.
  4. J. Shi and J. Malik. Normalized cuts and image segmentation. IEEE Transactions on Pattern Analysis and
    Machine Intelligence, 2000.
  5. P. Viola and M. Jones. Rapid object detection using a boosted cascade of simple features. In Proceedings of
    Computer Vision and Pattern Recognition, 2001.
  6. J. Wang, B. Thiesson, Y. Xu, and M. Cohen. Image and video segmentation by anisotropic kernel mean shift. In Proceedings of European Conference on Computer Vision, 2004.
  7. Y. Wexler, E. Shechtman, and M. Irani. Space-time behavior based correlation. In Proceedings of Computer Vision and Pattern Recognition, 2005.