Description

Extracting a rich representation of an environment from visual sensor readings can benefit many tasks in robotics, e.g., path planning, mapping, and object manipulation. While important progress has been made, it remains a difficult problem to effectively parse entire scenes, i.e., to recognize semantic objects, man-made structures, and landforms. This process requires not only recognizing individual entities but also understanding the contextual relations among them.

The prevalent approach to encode such relationships is to use a joint probabilistic or energy-based model which enables one to naturally write down these interactions. Unfortunately, performing exact inference over these expressive models is often intractable and instead we can only approximate the solutions. While there exists a set of sophisticated approximate inference techniques to choose from, the combination of learning and approximate inference for these expressive models is still poorly understood in theory and limited in practice. Furthermore, using approximate inference on any learned model often leads to suboptimal predictions due to the inherent approximations.

As we ultimately care about predicting the correct labeling of a scene, and not necessarily learning a joint model of the data, this work proposes to instead view the approximate inference process as a modular procedure that is directly trained in order to produce a correct labeling of the scene. Inspired by early hierarchical models in the computer vision literature for scene parsing, the proposed inference procedure is structured to incorporate both feature descriptors and contextual cues computed at multiple resolutions within the scene. We demonstrate that this inference machine framework for parsing scenes via iterated predictions offers the best of both worlds: state-of-the-art classification accuracy and computational efficiency when processing images and/or unorganized 3-D point clouds.


Updated Results (As of April 26, 2013)

The performance on the Stanford Background dataset is: Image classifications are available [here]. Differences from the ECCV 2010 publication are: Computation time breakdown per image (seconds):

Videos






Datasets


Code

The original naive Matlab implementation of the ECCV 2010 paper: [code] Creative Commons License


Presentations


References

ECCV 2010 Stacked Hierarchical Labeling
D. Munoz, J. A. Bagnell, M. Hebert
ECCV 2010 Oral Presentation
[pdf] [project page] [bibtex]
See the project page for updated results!

CVPR 2011 Learning Message-Passing Inference Machines
for Structured Prediction

S. Ross, D. Munoz, M. Hebert, J. A. Bagnell
CVPR 2011
[pdf] [project page] [bibtex]

ICRA 2011 3-D Scene Analysis via Sequenced Predictions
over Points and Regions

X. Xiong, D. Munoz, J. A. Bagnell, M. Hebert
ICRA 2011 Best Vision Paper Award Finalist
[pdf] [project page] [bibtex]

ECCV 2012 Co-inference for Multi-modal Scene Analysis
D. Munoz, J. A. Bagnell, M. Hebert
ECCV 2012
[pdf] [project page] [bibtex]

ECCV 2014 Pose Machines: Articulated
Pose Estimation via Inference Machines

V. Ramakrishna, D. Munoz, M. Hebert,
J. A. Bagnell, Y. Sheikh

ECCV 2014 Oral Presentation
[pdf] [project page] [bibtex]


Inference Machines:
Parsing Scenes via Iterated Predictions

D. Munoz
PhD Thesis, Carnegie Mellon University 2013
[pdf] [bibtex]

Funding