10-708 Probabilistic Graphical Models Fall 2008

Your Course Project

Your class project is an opportunity for you to explore an interesting multivariate analysis problem of your choice in the context of a real-world data set. All projects must have an implementation component, though theoretical aspects may also be explored. You should also evaluate your approach, preferably on real-world data. Below, you will find some project ideas, but the best idea would be to combine optimization with problems in your own research area. Your class project must be about new things you have done this semester; you can't use results you have developed in previous semesters. If you are uncertain about this requirement, please email the instructors.

Projects can be done by you as an individual, or in teams of two students. Each project will also be assigned a 708 instructor as a project consultant/mentor. They will consult with you on your ideas, but of course the final responsibility to define and execute an interesting piece of work is yours. Your project will be worth 30% of your final class grade, and will have two final deliverables:

a writeup in the format of a NIPS paper (8 pages maximum in NIPS format, including references; this page limit is strict), due Dec 3rd by 3pm by email to the instructors list, worth 60% of the project grade, and
a poster presenting your work for a special class poster session on Dec 1st, 3-6pm in the NSH Atrium, worth 20% of the project grade.

In addition, you must turn in a midway progress report (5 pages maximum in NIPS format, including references) describing the results of your first experiments by Nov 6th (by 5pm, start of recitation) (either by email or submitted to Michelle), worth 20% of the project grade. Note that, as with any conference, the page limits are strict! Papers over the limit will not be considered.

Project Proposal

You must turn in a brief project proposal (1-page maximum) by Oct 8th in class. Read the list of potential project ideas below (once posted). You are encouraged to use one of the ideas. If you prefer to do a different project and you are proposing your own data set you must have access to this data already, and present a clear proposal for what you would do with it.

Project proposal format: Proposals should be one page maximum. Include the following information:

Project title
Project idea. This should be approximately two paragraphs.
Data set you will use.
Software you will need to write.
Papers to read. Include 1-3 relevant papers. You will probably want to read at least one of them before submitting your proposal.
Teammate: will you have a teammate? If so, whom? Maximum team size is two students.
Nov 3 milestone: What will you complete by Nov 3? Experimental results of some kind are expected here.

Project suggestions:

Ideally, you will want to pick a problem in a domain of your interest, e.g., computer vision, natural language parsing, DNA sequence analysis, text information retrieval, network mining, reinforcement learning, sensor networks, etc., and formulate your problem using graphical models. You can then, for example, adapt and tailor standard inference/learning algorithms to your problem, and do a thorough performance analysis.

You can also find some project ideas below.

Topics

For each of the topics we provide some suggested readings. If you're interested in the problem, these are the references to start with. Do not consider these references exhaustive; you will be expected to review the literature in greater depth for your project. While you are not forced to choose one of these topics, it is strongly advised that you talk to the instructor if you want to deviate significantly from the topics below.

Topic A: Structure Learning

This area refers to finding the qualitative (graph) structure of a set of variables in either a directed or undirected graphical model. Potential projects include

Comparing structure learning algorithms for Bayesian networks (eg, hillclimbing, PDAGs, optimal reinsertion) in terms of quality of density estimation, sensitivity of the size of the data set, classification performance, etc.
Structure search given a fixed ordering -- If you are given a total ordering of the variables x₁...x_n where the parents of x_i are a subset of x₁...x_i-1, structure learning becomes simpler than search over the space of directed acyclic graphs (K&F 17.4.2)
Learning the structure of an undirected graphical model (Abbeel et. al. 2006, Parise and Welling 2006, Lee et. al. NIPS 2006, Wainwright et. al. NIPS 2006)
Learning compact representations for conditional probability distributions -- In discrete Bayesian networks having a large number of parents means a node's CPD is large. It is possible that given a particular assignment to a few of the parents, the rest of the parents do not matter (context-specific independence), which can lead to a compact representation of a CPD (K&F 17.6 and 5.3).
Bayesian model averaging -- instead of finding the single best structure for a Bayesian network, compute a posterior distribution over structures (K&F 17.5)
Optimal structure learning -- the naive algorithms are super-exponential in the number of variables, but both the optimal MAP (Singh & Moore 2005) and optimal BMA (Koivisto & Sood 2004) structures can be computed in exponential time at the cost of exponential memory.

References

Koller & Friedman Chapter 17

Pieter Abbeel, Daphne Koller and Andrew Y. Ng.
Learning Factor Graphs in Polynomial Time & Sample Complexity.
Journal of Machine Learning Research, 7(Aug):1743--1788, 2006.
http://ai.stanford.edu/~pabbeel//pubs/abbeel06a.pdf

High dimensional graphical model selection using L1-regularized logistic regression. Martin Wainwright, Pradeep Ravikumar, John Lafferty. NIPS 2006
http://www.cs.cmu.edu/~pradeepr/papers/graphl1nips06.pdf

S.-I. Lee, V. Ganapathi, and D. Koller (2007). "Efficient Structure Learning of Markov Networks using L1-Regularization." Advances in Neural Information Processing Systems (NIPS 2006).
http://ai.stanford.edu/~koller/Papers/Lee+al:NIPS06.pdf

Sridevi Parise and Max Welling (2006) Structure Learning in Markov Random Fields, NIPS 2006
http://www.ics.uci.edu/~welling/publications/papers/StructLearnMRF-submit.pdf

D. Margaritis. Distribution-Free Learning of Bayesian Network Structure in Continuous Domains. Proceedings of The Twentieth National Conference on Artificial Intelligence (AAAI), Pittsburgh, PA, July 2005.
http://www.cs.iastate.edu/~dmarg/Papers/Margaritis-AAAI05.pdf

Yuhong Guo and Russ Greiner (2005), ``Discriminative Model Selection for Belief Net Structures". In Proceedings of the Twentieth National Conference on Artificial Intelligence (AAAI-05).
http://www.cs.ualberta.ca/~yuhong/research/papers/bnmodelgg.pdf

Ajit Singh and Andrew Moore (2005), Finding Optimal Bayesian Networks by Dynamic Programming. Tech Report CMU-CALD-05-106
http://reports-archive.adm.cs.cmu.edu/anon/cald/CMU-CALD-05-106.pdf

Mikko Koivisto and Kismat Sood (2004), Exact Bayesian Structure Discovery in Bayesian Networks. JMLR 5.
http://jmlr.csail.mit.edu/papers/volume5/koivisto04a/koivisto04a.pdf

Topic B: Inference

The most common use of a probabilistic graphical model is computing queries, the conditional distribution of a set of variables given an assignment to a set of evidence variables. In general, this problem is NP-hard, which has led to a number of algorithms (both exact and approximate). Potential topics include

Comparing approximate inference algorithms in terms of accuracy, computational complexity, sensitivity to parameters. Some exact algorithms include Junction trees and Bucket elimination. On larger networks one typically resorts to algorithms that produce approximate solutions, such as sampling (Monte Carlo methods), variational inference, and generalized belief propagation.
Adaptive Generalized Belief Propagation (Welling 2004) & Expectation Propagation (K&F 10) -- Compare these methods to each other and Gibbs sampling.
Convex Procedures -- Methods that performance approximate inference by convex relaxation (Wainwright 2002 and Mudigonda et. al. 2007)
Linear programming methods for approximating the MAP assignment (Wainwright et. al. 2005b, Yanover et. al. 2006, Sontag. et. al. 2008)
Recursive conditioning -- An any-space inference algorithm that recursively decomposes an inference on a general Bayesian network into inferences on a smaller subnetwork. (Darwiche 2001).

References

Koller & Friedman Chapters 8-12

Adnan Darwiche
Recursive Conditioning
In Artificial Intelligence Journal. Vol 125, No 1-2, pages 5-41. 2001.
http://reasoning.cs.ucla.edu/fetch.php?id=18&type=ps

T. Jaakkola.
Tutorial on variational approximation methods.
In Advanced mean field methods: theory and practice. MIT Press, 2000.
http://people.csail.mit.edu/tommi/papers/Jaa-var-tutorial.ps

An Introduction to Variational Methods for Graphical Models M. I. Jordan, Z. Ghahramani, T. S. Jaakkola, and L. K. Saul. In M. I. Jordan (Ed.), Learning in Graphical Models, Cambridge: MIT Press, 1999.
http://www.cs.berkeley.edu/~jordan/papers/variational-intro.pdf

Yedidia, J.S.; Freeman, W.T.; Weiss, Y., "Generalized Belief Propagation", Advances in Neural Information Processing Systems (NIPS), Vol 13, pps 689-695, December 2000
http://www.merl.com/reports/docs/TR2000-26.pdf

Yedidia, J.S.; Freeman, W.T.; Weiss, Y., "Constructing Free-Energy Approximations and Generalized Belief Propagation Algorithms", IEEE Transactions on Information Theory, ISSN; 0018-9448, Vol. 51, Issue 7, pp. 2282-2312, July 2005
http://www.merl.com/reports/docs/TR2004-040.pdf

M. J. Wainwright, T. Jaakkola and A. S. Willsky. A new class of upper bounds on the log partition function. IEEE Trans. on Information Theory, vol. 51, page 2313--2335, July 2005
http://www.eecs.berkeley.edu/~wainwrig/Papers/WaiJaaWil05_Upper.pdf

M. J. Wainwright, "Stochastic Processes on Graphs: Geometric and Variational Approaches", Ph.D. Thesis, Department of EECS, Massachusetts Institute of Technology, 2002.
http://www.eecs.berkeley.edu/~wainwrig/Papers/Final2_Phd_May30.pdf

Pawan Mudigonda, Vladimir Kologorov, and Philip Torr An Analysis of Convex Relaxations for MAP Estimation
http://www.robots.ox.ac.uk/~pawan/kumar07c.pdf

M. J. Wainwright, T. S. Jaakkola and A. S. Willsky,
MAP estimation via agreement on (hyper)trees: Message-passing and linear-programming
approaches. IEEE Transactions on Information Theory, Vol. 51(11), pages 3697--3717. November 2005.
http://people.csail.mit.edu/tommi/papers/WaiJaaWil_TRMAP_arxiv.pdf

Linear Programming Relaxations and Belief Propagation - an Empirical Study
Chen Yanover, Talya Meltzer, Yair Weiss
JMLR Special Issue on Machine Learning and Large Scale Optimization, Sep 2006
http://www.jmlr.org/papers/volume7/yanover06a/yanover06a.pdf

D. Sontag, T. Meltzer, A. Globerson, Y. Weiss, T. Jaakkola. "Tightening LP Relaxations for MAP using Message Passing". Uncertainty in Artificial Intelligence UAI 2008
http://people.csail.mit.edu/dsontag/papers/sontag_uai08.pdf

Max Welling. On the Choice of Regions for Generalized Belief Propagation
UAI 2004
http://www.ics.uci.edu/~welling/publications/papers/ClusterChoice.pdf

Max Welling, Tom Minka and Yee Whye Teh (2005) Structured Region Graphs: Morphing EP into GBP. UAI 2005
http://www.ics.uci.edu/~welling/publications/papers/full.pdf

Topic C: Temporal Models

There are lots of applications where we want to explicitly model time (control, forecasting, online-learning). Hidden Markov Models are one of the simplest discrete-time models, but there are many others: Kalman filters for continuous state-spaces, factorial Hidden Markov models for problems with many hidden variables that allows for efficient variational inference, and dynamic Bayesian networks which allow arbitrarily complex relationships between hidden and observed variables. Projects include,

Comparing the performance of factorial Hidden Markov Models (Ghahramani & Jordan 1997) to dynamic Bayesian networks (K&F 7.2 and 13).
Expermental evaluation of approximate inference algorithms for DBNs, such as Boyen-Koller, Particle Filtering, and Thin Junction Trees (Paskin 2003). Kevin Murphy's thesis provides a good overview of inference in DBNs.
Comparing Kalman filters against more general DBN models.
Application of temporal models in vision (Nefian et. al. 2002)and Bio-informatics (Shi etl. al. 2007).

References

K&F Chapters 7.2 and 13

Ghahramani, Z. and Jordan, M.I. (1997). Factorial Hidden Markov Models. Machine Learning 29: 245-273
http://www.gatsby.ucl.ac.uk/~zoubin/papers/fhmmML.ps.gz

Kevin Murphy's PhD Thesis.
http://www.cs.ubc.ca/~murphyk/Thesis/thesis.pdf

Kevin Murphy's book chapter on DBNs:
http://www.cs.ubc.ca/~murphyk/Papers/dbnchapter.pdf

Xavier Boyen and Daphne Koller, Tractable Inference for Complex Stochastic Processes, in Uncertainty in Artificial Intelligence UAI '98, 1998.
http://ai.stanford.edu/~xb/uai98/index.html

Xavier Boyen and Daphne Koller, Exploiting the Architecture of Dynamic Systems, in National Conference on Artificial Intelligence AAAI '99, 1999.
http://ai.stanford.edu/~xb/aaai99/index.html

Mark A. Paskin (2003). Thin Junction Tree Filters for Simultaneous Localization and Mapping. In G. Gottlob and T. Walsh eds., Proceedings of the Eighteenth International Joint Conference on Artificial Intelligence ( IJCAI-03), pp. 1157–1164. San Francisco, CA: Morgan Kaufmann.
http://ai.stanford.edu/~paskin/pubs/Paskin2003a.pdf

Y. Shi, F. Guo, W. Wu and E. P. Xing, GIMscan: A New Statistical Method for Analyzing Whole-Genome Array CGH Data, The Eleventh Annual International Conference on Research in Computational Molecular Biology (RECOMB 2007).
http://www.springerlink.com/content/3j5316255510j15v/fulltext.pdf

A. Nefian, L. Liang, X. Pi, X. Liu and K. Murphy Dynamic Bayesian Networks for Audio-Visual Speech Recognition EURASIP, Journal of Applied Signal Processing, 11:1-15, 2002
http://www.cs.ubc.ca/~murphyk/Papers/avsr_journal.pdf

Topic D: Hierarchical Bayes Topic Models

Statistical topic models have recently gained much popularity in managing large collection of text documents. These models make the fundamental assumption that a document is a mixture of topics, where the mixture proportions are document-specific, and signify how important each topic is to the document. Moreover, each topic is a multinomial distribution over a given vocabulary which in turn dictates how important each word is for a topic. The document- specific mixture proportions provide a low-dimensional representation of the document into the topic-space. This representation captures the latent semantic of the collection and can then be used for tasks like classifications and clustering, or merely as a tool to structurally browse the otherwise unstructured collection. The most famous of such models is known as LDA ,Latent Dirichlet Allocation (Blei et. al. 2003). LDA has been the basis for many extensions in text, vision, bioiformatic, and social networks. These extensions incorporate more dependency structures in the generative process like modeling authors-topic dependency, or implement more sophisticated ways of representing inter-topic relationships.

Potential projects include

Compare approximate inference algorithms for LDA which includes: variational inference (Blei et. al. 2003), collapsed gibbs sampling (Griffth et. al. 2004) and collapsed variational inference (Teh. et. al. 2006). You should compare them over simulated data by varying the corpus generation parameters --- number of optics, size of vocabulary, document length, etc --- in addition to comparison over several real world datasets. Code can be found for VI and Gibbs sampling; you will need to implement the collapsed variational inference though.
Implement one of the extensions of LDA listed below or propose a new latent topic model that suits a data set in your area of interest
See the next topic, non-parametric Bayes, for more ideas about implementing Hierarchical topic models (Blei. et. al. 2003) via the nested Chinese restaurant process, and for implementing a non-parametric topic model that automatically learns the number of topics via HDP (Teh. el. al. 2006). You can find code for the latter, so you should think of a bigger application like applying HDP to one of LDA's extensions

References

D. Blei. Probabilistic Models of Text and Images. PhD thesis, U.C. Berkeley, Division of Computer Science, 2004.
http://www.cs.princeton.edu/~blei/papers/Blei2004.pdf

Inference:

D. Blei, A. Ng, and M. Jordan. Latent Dirichlet allocation. Journal of Machine Learning Research, 3:993–1022, January 2003.
http://www.cs.princeton.edu/~blei/papers/BleiNgJordan2003.pdf

Griffiths, T, Steyvers, M. (2004). Finding scientific topics. Proceedings of the National Academy of Sciences, 101, 5228-5235 2004.
http://www.pnas.org/content/101/suppl.1/5228.full.pdf

Y.W. Teh, D. Newman and M. Welling. A Collapsed Variational Bayesian Inference Algorithm for Latent Dirichlet Allocation.In NIPS 2006.
http://www.gatsby.ucl.ac.uk/~ywteh/research/inference/nips2006.pdf

Extensions:

Rosen-Zvi, M., Griffiths, T., Steyvers, M., & Smyth, P. The Author-Topic Model for authors and documents.In UAI 2004.
http://cocosci.berkeley.edu/tom/papers/author_topics_uai.pdf

D. Blei, J. McAuliffe. Supervised topic models. In Advances in Neural Information Processing Systems 21, 2007
http://www.cs.princeton.edu/~blei/papers/BleiMcAuliffe2007.pdf

J. Boyd-Graber, D. Blei, and X. Zhu. A topic model for word sense disambiguation. In Empirical Methods in Natural Language Processing, 2007.
http://www.cs.princeton.edu/~blei/papers/Boyd-GraberBleiZhu2007.pdf

Wei Li and Andrew McCallum. Pachinko Allocation: Scalable Mixture Models of Topic Correlations. Submitted to the Journal of Machine Learning Research, (JMLR), 2008
http://www.cs.umass.edu/~mccallum/papers/pam08jmlrs.pdf

Application in Vision:

L. Fei-Fei and P. Perona. A Bayesian Hierarchical Model for Learning Natural Scene Categories. IEEE Comp. Vis. Patt. Recog. 2005.
http://vision.cs.princeton.edu/documents/Fei-FeiPerona2005.pdf

L. Cao and L. Fei-Fei. Spatially coherent latent topic model for concurrent object segmentation and classification . IEEE Intern. Conf. in Computer Vision (ICCV). 2007
http://vision.cs.princeton.edu/documents/CaoFei-Fei_ICCV2007.pdf

Application in Social Networks:

Andrew McCallum, Andres Corrada-Emmanuel, Xuerui Wang The Author-Recipient-Topic Model for Topic and Role Discovery in Social Networks: Experiments with Enron and Academic Email. Technical Report UM-CS-2004-096, 2004.
http://www.cs.umass.edu/~mccallum/papers/art04tr.pdf

E. Airoldi, D. Blei, E.P. Xing and S. Fienberg, Mixed Membership Model for Relational Data. JMLR 2008.
http://jmlr.csail.mit.edu/papers/volume9/airoldi08a/airoldi08a.pdf

Software:

Mark Steyvers and Tom Griffiths
Matlab Topic Modelling Toolbox.
http://psiexp.ss.uci.edu/research/programs_data/toolbox.htm

David Blei
Latent Dirichlet allocation (LDA) in C .
http://www.cs.princeton.edu/~blei/lda-c/index.html

Topic D: Non-parametric Hierarchical Bayes and Dirichlet processes

Clustering is an important problem in machine learning in which the goal is learn the latent groups (clusters) in the data. While parametric approaches to clustering requires specifications of the number of clusters, non-parametric approaches, like Dirichlet process mixture models (DPM), can model potentially countably infinite number of clusters. DP provides a distribution over partitions of the data (i.e. clusters) and can be used as a prior over the number of clusters. Posterior inference (MAP) can then be used to do automatic model selection or a fully bayesian approach can be used to integrate all possible clusterings, weighted by their posterior probability, in future predictions . DP has been widely used not only in simple clustering settings, but also to model (and learn from data) general structures like trees, grammars, hierarchies, etc with interesting applications in information retrieval, natural langauge processing, vision, and biology.

Potential projects include

Compare inference algorithms for DPM that includes: Gibbs Sampling, collapsed Gibbs sampling (Neal 1998), variational inference (Blei 2004), search based (Daume 2007) and sequential Monto Carlo (Mansinghka 2007). Code can be found for some of these, see below
Take an extension of LDA, see the previous topic, and use Dirichlet processes to automatically learn number of topics. For an example, see how HDP was used to learn the number of topics in LDA (Teh 2006)
Implement the hierarchal topic model paper using the nested Chinese restaurant process (Blei et. al. 2003)
Implement one of the papers listed below for applying non-parametric Bayes in an area of interest to you, to automatically learn latent structures (clusters, groups, trees, etc) and compare the performance with a parametric model

References

Overview/basics:

Dirichlet process, Chinese restaurant processes and all that. M. I. Jordan. Tutorial presentation at the NIPS Conference, 2005.
http://www.cs.berkeley.edu/~jordan/nips-tutorial05.ps

Zoubin Ghahramani's UAI tutorial slides
http://learning.eng.cam.ac.uk/zoubin/talks/uai05tutorial-b.pdf

Yee Whee Teh. Dirichlet process, Tutorial and Practical Course. MLSS 2007
http://www.gatsby.ucl.ac.uk/~ywteh/research/npbayes/mlss2007.pdf

Y. Teh, M. Jordan, M. Beal, and D. Blei. Hierarchical Dirichlet processes. Journal of the American Statistical Association, 2005.
http://www.cs.princeton.edu/~blei/papers/TehJordanBealBlei2004.pdf

Inference:

D. Blei and M. Jordan. Variational inference for Dirichlet process mixtures. Journal of Bayesian Analysis, 1(1):121–144, 2005.
http://www.cs.princeton.edu/~blei/papers/BleiJordan2004.pdf

Neal, R. M. (1998) ``Markov chain sampling methods for Dirichlet process mixture models'', Technical Report No. 9815, Dept. of Statistics, University of Toronto
http://www.cs.toronto.edu/~radford/ftp/mixmc.pdf

Hal Daume III. Fast search for Dirichlet process mixture models # Conference on AI and Statistics (2007)
http://www.cs.utah.edu/~hal/docs/daume07astar-dp.pdf

Vikash K. Mansinghka, Daniel M. Roy, Ryan Rifkin, Josh Tenenbaum. A-Class: A simple, online, parallelizable algorithm for probabilistic classification
http://www.stat.umn.edu/~aistat/proceedings/data/papers/040.pdf

Ian Porteous, Alex Ihler, Padhriac Smyth and Max Welling (2006) Gibbs Sampling for (Coupled) Infinite Mixture Models in the Stick-Breaking Representation UAI 2006
http://www.ics.uci.edu/~welling/publications/papers/ddp_uai06_v8.pdf

Applications:

Modeling Documents and IR: D. Blei, T. Griffiths, M. Jordan, and J. Tenenbaum. Hierarchical topic models and the nested Chinese restaurant process. In Neural Information Processing Systems (NIPS) 16, 2003.
http://www.cs.princeton.edu/~blei/papers/BleiGriffithsJordanTenenbaum2003.pdf

NLP: P. Liang, S. Petrov, D. Klein, and M. Jordan. The infinite PCFG using Hierarchical Dirichlet processes. In Empirical Methods in Natural Language Processing, 2007
http://www.eecs.berkeley.edu/~pliang/papers/hdppcfg-emnlp2007.pdf

Haghighi, A. and Klein, D. (2007). Unsupervised co-reference resolution in a nonparametric Bayesian model. In Proceedings of the Annual Meeting of the Association for Computational Linguistics
http://www.eecs.berkeley.edu/~aria42/pubs/acl07-hdp-coref.pdf

Vision: Sudderth, E., Torralba, A., Freeman, W., and Willsky, A. (2005). Describing visual scenes using transformed Dirichlet processes. In Advances in Neural Information Processing Systems 18
http://ssg.mit.edu/~esuddert/papers/nips05.pdf

J. Sivic, B. C. Russell, A. Zisserman, W. T. Freeman, and A. A. Efros. Unsupervised Discovery of Visual Object Class Hierarchies. CVPR 2008
http://www.di.ens.fr/~russell/papers/Sivic08.pdf

Information Integration: Robert Hall, Charles Sutton, Andrew McCallum. Unsupervised Deduplication using Cross-field Dependencies. KDD 2008
http://www.cs.umass.edu/~mccallum/papers/kdd289-hall.pdf

Software:

Y.W. Teh. Nonparametric Bayesian Mixture Models - release 2.1.
http://www.gatsby.ucl.ac.uk/~ywteh/research/software.html

Hal Daume III . Fast search for Dirichlet process mixture models
http://www.cs.utah.edu/~hal/DPsearch/

Kenichi Kurihara,. Variational Dirichlet Process Gaussian Mixture Model
http://sato-www.cs.titech.ac.jp/kurihara/vdpmog.html

Topic E: Relational Models

Almost all of the machine learning / statistics methods you have studied assume that the data is independent or exchangable. In many cases this is not true. For example, knowing the topic of a web page tells you something about the likely topics of pages linked to it. The independence assumption fails on most graph-structured data sets (relational databases, social networks, web pages).

Potential projects include

Implementing a restricted case of Probabilistic Relational Models (eg, no existence uncertainty) and compare the performance against some baseline non-relational models.
Implementing Relational Markov Networks and compare the performance against some baseline non-relational models

References

Learning Probabilistic Relational Models, L. Getoor, N. Friedman, D. Koller, A. Pfeffer. Invited contribution to the book Relational Data Mining, S. Dzeroski and N. Lavrac, Eds., Springer-Verlag, 2001
http://www.cs.umd.edu/~getoor/Publications/lprm-ch.ps

Discriminative Probabilistic Models for Relational Data, B. Taskar, P. Abbeel and D. Koller. Eighteenth Conference on Uncertainty in Artificial Intelligence (UAI02), Edmonton, Canada, August 2002.
http://www.cs.berkeley.edu/~taskar/pubs/rmn.ps

L. Liao, D. Fox, and H. Kautz. Location-Based Activity Recognition. in Proceedings of the Neural Information Processing Systems (NIPS), 2005.
http://www.cs.washington.edu/homes/liaolin/Research/nips2005.pdf

Razvan Bunescu and Raymond J. Mooney. Statistical Relational Learning for Natural Language Information Extraction Introduction to Statistical Relational Learning, Getoor, L. and Taskar, B. (Eds.), pp. 535-552, MIT Press, Cambridge, MA, 2007.
http://www.cs.utexas.edu/users/ml/papers/srl-submitted-05.pdf

Hoifung Poon and Pedro Domingos. Joint Unsupervised Coreference Resolution with Markov Logic. EMNLP 2008.
http://www.cs.washington.edu/homes/hoifung/papers/poon08b.pdf

Topic F: Hybrid Bayesian Networks

Many real systems contain a combination of discrete and continuous variables, which can be modeled as a hybrid BN. Potential projects include

Compare inference algorithms for hybrid DBNs against those that first discretize all the continuous variables, and then just use the standard algorithms (variable elimination, junction trees).

References

K&F Chapter 14

Hybrid Bayesian Networks for Reasoning about Complex Systems, Uri N. Lerner. Ph.D. Thesis, Stanford University, October 2002.
http://ai.stanford.edu/~uri/Papers/thesis.ps.gz

Topic G: Influence Diagrams

A Bayesian network models a part of the world, but not decisions taken by agents nor the effect that these decisions can have upon the world. Influence diagrams extend Bayesian networks with nodes that represent actions an agent can take, the costs and utilities of actions, and most importantly the relationships between them.

In multiagent setting finding the Nash equilibrium is hard, but graphical models provide a framework for recursively decomposing the problem (opening up the possibility of a dynamic programming approach). Dynamic programming algorithms like NashProp (Kearns and Ortiz, 2002) are closely related to belief propagation.

Projects include

Implementing algorithms for selecting a good or optimal strategy in the single-agent case (K&F 22)
Finding Nash equilibria in multiplayer games (Koller & Milch, 2003)

References

K&F Chapter 22

D. Koller and B. Milch (2003). "Multi-Agent Influence Diagrams for Representing and Solving Games." Games and Economic Behavior, 45(1), 181-221. Full version of paper in IJCAI '03.
http://ai.stanford.edu/~koller/Papers/Koller+Milch:GEB03.pdf

Nash Propagation for Loopy Graphical Games. M. Kearns and L. Ortiz. Proceedings of NIPS 2002.
http://www.cis.upenn.edu/~mkearns/papers/nashprop.pdf

Multiagent Planning with Factored MDPs;
Carlos Guestrin, Daphne Koller and Ronald Parr;
In Advances in Neural Information Processing Systems (NIPS 2001), pp. 1523 - 1530, Vancouver, Canada, December 2001.
http://www.cs.cmu.edu/~guestrin/Publications/NIPS2001MultiAgents/nips01-multiagents.ps.gz

Planning Under Uncertainty in Complex Structured Environments;
Carlos Guestrin;
Ph.D. Dissertation, Computer Science Department, Stanford University, August 2003.
http://www.cs.cmu.edu/~guestrin/Publications/Thesis/thesis.pdf

Topic H: Max-margin Graphical Models

Typically the parameters of a graphical model are learned by maximum likelihood or maximum a posterori. An alternative criteria for parameter estimation is to maximize the margin between classes, which can be thought of as a combination of graphical models (to represent structured relationships between inputs and outputs) with kernel methods. Projects include,

An example of a domain where this approach works well is handwriting recognition, where the structure encodes the fact that knowing what the previous letter was tells you something about what the next letter is likely to be.

Compare max-margin to likelihood based methods (eg, character recognition, part of speech tagging)

References

Max-Margin Markov Networks, B. Taskar, C. Guestrin and D. Koller. Neural Information Processing Systems Conference (NIPS03), Vancouver, Canada, December 2003.
http://www.cs.berkeley.edu/~taskar/pubs/mmmn.ps

Taskar's thesis:
http://www.cs.berkeley.edu/~taskar/pubs/thesis.pdf

Topic I: Active Learning / Value of Information

Active learning refers to algorithms where the learner has some influence on what samples he sees. For example, say you can perform 5 tests on a patient, out of a panel of 60 tests. Given an existing model of patients, which ones do you pick ? What about the sequential case where you consider the result of each test before choosing another one ? Possible projects include,

Apply active learning to activity modelling or sensor networks (which sensor should you sample from).
Compare optimization criteria (eg, experimental design criteria) [CITE]
Active learning that models parameter uncertainty.

References

A. Krause, C. Guestrin. "Near-optimal Nonmyopic Value of Information in Graphical Models". Proc. of Uncertainty in Artificial Intelligence (UAI), 2005
http://www.cs.cmu.edu/~krausea/files/05nearoptimal.pdf

A. Krause, C. Guestrin. "Optimal Value of Information in Graphical Models - Efficient Algorithms and Theoretical Limits". Proc. of the International Joint Conference on Artificial Intelligence (IJCAI), 2005
http://www.cs.cmu.edu/~krausea/files/05optimal.pdf

Anderson, B. and Moore, A.
Fast Information Value for Graphical Models
In Neural Information Processing Systems, 2005.
http://www.cs.cmu.edu/~brigham/papers/nips2005.pdf

Active Learning: Theory and Applications. Simon Tong. Stanford University 2001.
http://www.robotics.stanford.edu/~stong/papers/tong_thesis.pdf

Topic J: Modeling Text and Images

Images are oftened annotated with text, such as captions or tags, which can be viewed as an additional source of information when clustering images or building topic models. For example a green patch might indicate that there is a plant in the image, until one reads the caption "man in a green shirt". A related problem (Carbonetto et. al. 2004) is data association, linking words to segmented objects in an image. For example, if the caption contains the words boat and sea we would like to be able to associate these words with the segment(s) of the image corresponding to boat and sea.

References

D. Blei and M. Jordan. Modeling annotated data. In Proceedings of the 26th annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 127–134
http://www.cs.princeton.edu/~blei/papers/BleiJordan2003.pdf

Peter Carbonetto, Nando de Freitas and Kobus Barnard.
A Statistical Model for General Contextual Object Recognition. ECCV 2004
http://www.cs.ubc.ca/~nando/papers/mrftranstwo.pdf

Vidit Jain, Erik Learned-Miller, Andrew McCallum. People-LDA: Anchoring Topics to People using Face Recognition. International Conference on Computer Vision (ICCV), 2007
http://vis-www.cs.umass.edu/~vidit/peopleLDA/

Topic K: 2D CRFs for Visual Texture Classification

Discriminative Fields for Modeling Spatial Dependencies in Natural Images is about applying 2D conditional random fields (CRFs) for classifying image regions as containing "man-made building" or not, on the basis of texture. The goal of this project is to reproduce the results in the NIPS 2003 paper. Useful links:

labeled training data.
C++ graphcuts code for approximate inference
Kevin Murphys Matlab CRF code
Carl Rasmussen's matlab conjugate gradient minimizer (better than using netlab or matlab optimization toolbox)
Intro to CRFs by Hanna Wallach
Maxent page, includes code
Steerable pyramid matlab code, possibly useful set of image features
Matlab wavelet toolbox, possibly useful set of image features .
Paper of CRFs for sign detection, J. Weinman, 2004
Markov Random Field Modeling in Computer Vision, S. Z. Li, 1995. (I have a hardcopy of the 2001 edition.)
G. Winkler, "Image Analysis, Random Fields, and MCMC Methods", 2nd edition, 2003.
Markov random fields and images, P. Perez. CWI Quarterly, 11(4):413-437, 1998. Review article.

2D CRFs for satellite image classification

The goal of this project is to classify pixels in satellite image data into classes like field vs road vs forest, using MRFs/CRFs (see above), or some other technique. Some possibly useful links:

Fully Bayesian Image Segmentation -- an Engineering Perspective, Morris et al, 1996.
A binary tree-structured MRF model for multispectral satellite image segmentation, 2003

Topic L: MAP-MRF Inference via Graph Cuts

Recent works have shown that for a particular class of pairwise potentials (loosely related to submodular functions), MAP inference in MRFs can be achieved via Graph Cuts for which max-flow based polynomial-time algorithms exist. Possible goals for this project include:

Implement multi-label MAP inference via alpha-expansion or alpha-beta swap
Compare graph cuts based approximation to Loopy BP, and Generalized BP on standard datasets

References

Fast Approximate Energy Minimization via Graph Cuts, Yuri Boykov, Olga Veksler and Ramin Zabih. IEEE Transactions on Pattern Analysis and Machine Intelligence 23(11), November 2001.

What Energy Functions can be Minimized via Graph Cuts?, Vladimir Kolmogorov and Ramin Zabih. In: IEEE Transactions on Pattern Analysis and Machine Intelligence, February 2004.

A Comparative Study of Energy Minimization Methods for Markov Random Fields., Rick Szeliski, Ramin Zabih, Daniel Scharstein, Olga Veksler, Vladimir Kolmogorov, Aseem Agarwala, Marshall Tappen, Carsten Rother.

Datasets

Below are a number of data sets that could be used for your project. If you want to use a data set that is not on the list it is strongly advised that you talk to either a TA or the instructor before submitting your intial proposal.

Thanks to Dieter Fox, Andreas Krause, Lin Liao, Einat Minkov, Francisco Pereira, Sam Roweis, and Ben Taskar for donating data sets.

Data A: Functional MRI

Functional fMRI measures brain activation over time, which allows one to measure changes as an activity is performed (eg, looking at a picture of a cat vs. looking at a picture of a chair). Tasks using this data are typically of the form "predict cognitive state given fMRI data". fMRI data is both temporal and spatial: each voxel contains a time series, each voxel is correlated to voxels near it.

http://multivac.ml.cmu.edu/10708

Data B: Corel Image Data

Images featurized by color histogram, color histogram layout, color moments, and co-occurence texture. Useful for projects on image segementation, especially since there is a large benchmark repository available.

Most segmentation algorithms have focused on segmentation based on edges or based on discontinuity of color and texture. The ground-truth in this dataset, however, allows supervised learning algorithms to segment the images based on statistics calculated over regions. One way to do this is to "oversegment" the image into superpixels (Felzenszwalb 2004, code available) and merge the superpixels into larger segments. Graphical models can be used to represent smoothness in clusters, by adding appropriate potentials between neighboring pixels. In this project, you can address, for example, learning of such potentials, and inference in models with very large tree-width.

http://kdd.ics.uci.edu/databases/CorelFeatures/CorelFeatures.html
http://www.eecs.berkeley.edu/Research/Projects/CS/vision/bsds/

Data C: Twenty Newsgroups

This data set contains 1000 text articles posted to each of 20 online newgroups, for a total of 20,000 articles. This data is useful for a variety of text classification and/or clustering projects. The "label" of each article is which of the 20 newsgroups it belongs to. The newsgroups (labels) are hierarchically organized (e.g., "sports", "hockey").

http://www-2.cs.cmu.edu/afs/cs/project/theo-11/www/naive-bayes.html

Data D: Sensor Networks

Using this 54-node sensor network deployment, we collected temperature, humidity, and light data, along with the voltage level of the batteries at each node. The data was collected every 30 seconds, starting around 1am on February 28^th 2004.

http://www-2.cs.cmu.edu/~guestrin/Research/Data/

This is a real dataset, with lots of missing data, noise, and failed sensors giving outlier values, especially when battery levels are low. Additional data for an intelligent lighting network, which include link quality information between pairs of sensors can is available at

http://www.cs.cmu.edu/~guestrin/Class/10708-F08/projects/lightsensor.zip

Ideas for projects include

· Learn graphical models representing the correlations between measurements at different nodes

· Develop new distributed algorithms for solving a learning task on this data

References:

http://www-2.cs.cmu.edu/~guestrin/Publications/IPSN2004/ipsn2004.pdf
http://www-2.cs.cmu.edu/~guestrin/Publications/VLDB04/vldb04.pdf

Data E: arXiv Preprints

A collection of preprints in the field of high-energy physics. Includes the raw LaTeX source of each paper (so you can extract either structured sentences or a bag-of-words) along with the graph of citations between papers.

http://www.cs.cornell.edu/projects/kddcup/datasets.html

Data F: TRECVID

A competition for multimedia information retrieval. They keep a fairly large archive of video data sets, along with featurizations of the data.

http://www-nlpir.nist.gov/projects/trecvid/trecvid.data.html

Data G: Activity Modelling

Activity modelling is the task of inferring what the user is doing from observations (eg, motion sensors, microphones). This data set consists of GPS motion data for two subjects tagged with labels like car, working, athome, shopping.

http://www.cs.cmu.edu/~guestrin/Class/10708-F08/projects/gps-labels.zip

An example of a DBN model for this problem is

A. Subramanya, A. Raj, J. Bilmes, and D. Fox.
Recognizing Activities and Spatial Context Using Wearable Sensors (UAI-2006)
http://www.cs.washington.edu/homes/fox/abstracts/gps-msb-uai-06.abstract.html

Data H: WebKB

This dataset contains webpages from 4 universities, labeled with whether they are professor, student, project, or other pages.

http://www-2.cs.cmu.edu/~webkb/

Ideas for projects: learning classifiers to predict the type of webpage from the text, using web structure to improve page classification.

Data I: Record Deduplication

The datasets provided below comprise of lists of records, and the goal is to identify, for any dataset, the set of records which refer to unique entities. This problem is known by the varied names of deduplication, identity uncertainty and record linkage.

http://www.cs.utexas.edu/users/ml/riddle/data.html

One common approach is to cast the deduplication problem as a classification problem. Consider the set of record-pairs, and classify them as either "unique" or "not-unique". Some papers on record deduplication include

www.isi.edu/info-agents/papers/tejada01-is.pdf
http://www.cs.cmu.edu/~pradeepr/papers/kdd03.pdf

Data J: Enron e-mail

Consists of ~500K e-mails collected from Enron employees. It has been used for research into information extraction, social network analysis, and topic modeling.

http://www.cs.cmu.edu/~enron/
http://www.cs.cmu.edu/~einat/datasets.html

Data K: Internet Movie Database

The Internet Movie Database makes their data publically available, with certain usage restrictions. It contains tables and links relating movies, actors, directors, box office grosses, and much more. Various slices of the data have been used extensively in research on relational models.

http://www.imdb.com/interface

Data L: Netflix

Netflix is running a competition for movie recommendation algorithms. They've released a dataset of 100M ratings from 480K randomly selected users over 17K titles. The data set, and contest details, are available at

http://www.netflixprize.com

A much smaller (but more widely used) movie rating data set is Movielens

http://www.grouplens.org/

Data M: NIPS Corpus

A data set based on papers from a machine learning conference (NIPS volumes 1-12). The data can be viewed as a tripartite graph on authors, papers, and words. Links represent authorship and the words used in a paper. Additionally, papers are tagged with topics and we know which year each paper was written. Potential projects include authorship prediction, document clustering, and topic tracking.

http://www.cs.toronto.edu/~roweis/data.html

Data N: Character recognition (digits)

Optical character recognition, and the simpler digit recognition task, has been the focus of much ML research. We have two datasets on this topic. The first tackles the more general OCR task, on a small vocabulary of words: (Note that the first letter of each word was removed, since these were capital letters that would make the task harder for you.)

http://ai.stanford.edu/~btaskar/ocr/

Data O: Precipitation Data

This dataset has includes 45 years of daily precipitation data from the Northwestern US. Ideas for projects include predicting rain levels, deciding where to place sensors to best predict rainfall, or active learning in fixed sensor networks.

Other sources of data

UC Irvine has a repository that could be useful for your project. Many of these data sets have been used extensively in graphical models research.

http://www.ics.uci.edu/~mlearn/MLRepository.html

Sam Roweis also has a link to several datasets (most ready for use in Matlab):

http://www.cs.toronto.edu/~roweis/data.html

10-708 Probabilistic Graphical Models Fall 2008

Carlos Guestrin

School of Computer Science, Carnegie Mellon University

Topics

2D CRFs for satellite image classification