Daniel Munoz | Inference Machines

Description

Extracting a rich representation of an environment from visual sensor readings can benefit many tasks in robotics, e.g., path planning, mapping, and object manipulation. While important progress has been made, it remains a difficult problem to effectively parse entire scenes, i.e., to recognize semantic objects, man-made structures, and landforms. This process requires not only recognizing individual entities but also understanding the contextual relations among them.

The prevalent approach to encode such relationships is to use a joint probabilistic or energy-based model which enables one to naturally write down these interactions. Unfortunately, performing exact inference over these expressive models is often intractable and instead we can only approximate the solutions. While there exists a set of sophisticated approximate inference techniques to choose from, the combination of learning and approximate inference for these expressive models is still poorly understood in theory and limited in practice. Furthermore, using approximate inference on any learned model often leads to suboptimal predictions due to the inherent approximations.

As we ultimately care about predicting the correct labeling of a scene, and not necessarily learning a joint model of the data, this work proposes to instead view the approximate inference process as a modular procedure that is directly trained in order to produce a correct labeling of the scene. Inspired by early hierarchical models in the computer vision literature for scene parsing, the proposed inference procedure is structured to incorporate both feature descriptors and contextual cues computed at multiple resolutions within the scene. We demonstrate that this inference machine framework for parsing scenes via iterated predictions offers the best of both worlds: state-of-the-art classification accuracy and computational efficiency when processing images and/or unorganized 3-D point clouds.

Updated Results (As of April 26, 2013)

The performance on the Stanford Background dataset is:

Overall pixel accuracy: 81.6%
Average per-class accuracy: 71.8%

Image classifications are available [here]. Differences from the ECCV 2010 publication are:

Using simple F-H segmentation to create a 4-level hierarchy
Iterating up and down the hierarchy, as in the below ICRA 2011 paper
Using feature descriptors provided by Ladicky 2011
Using vector quantization described by Coates 2011
Using multi-output regression trees (instead of 1 per class) during boosting

Computation time breakdown per image (seconds):

Segmentations: 0.095
Features: 0.462
Inference: 0.037

Videos

Datasets

Code

The original naive Matlab implementation of the ECCV 2010 paper: [code]

Presentations

ECCV 2010 talk: [pptx] [pdf]
ICRA 2011 talk: [pptx] [pdf]
CVPR 2011 poster: [pdf]
ECCV 2012 poster: [pdf]
ECCV 2014 talk and poster: [link]

References

	Stacked Hierarchical Labeling D. Munoz, J. A. Bagnell, M. Hebert ECCV 2010 Oral Presentation [pdf] [project page] [bibtex] See the project page for updated results!

	Learning Message-Passing Inference Machines for Structured Prediction S. Ross, D. Munoz, M. Hebert, J. A. Bagnell CVPR 2011 [pdf] [project page] [bibtex]

	3-D Scene Analysis via Sequenced Predictions over Points and Regions X. Xiong, D. Munoz, J. A. Bagnell, M. Hebert ICRA 2011 Best Vision Paper Award Finalist [pdf] [project page] [bibtex]

	Co-inference for Multi-modal Scene Analysis D. Munoz, J. A. Bagnell, M. Hebert ECCV 2012 [pdf] [project page] [bibtex]

	Pose Machines: Articulated Pose Estimation via Inference Machines V. Ramakrishna, D. Munoz, M. Hebert, J. A. Bagnell, Y. Sheikh ECCV 2014 Oral Presentation [pdf] [project page] [bibtex]


	Inference Machines: Parsing Scenes via Iterated Predictions D. Munoz PhD Thesis, Carnegie Mellon University 2013 [pdf] [bibtex]

Funding

QinetiQ North America Robotics Fellowship
ONR MURI grant N00014-09-1-1052, Reasoning in Reduced Information Spaces
Collaborative Technology Alliance Program, Cooperative Agreement W911NF-10-2-0016

goto