Grading

Probabilistic Graphical Models

10-708, Fall 2007

School of Computer Science, Carnegie-Mellon University

Course Project

Your class project is an opportunity for you to explore an interesting multivariate analysis problem of your choice in the context of a real-world data set. Projects can be done by you as an individual, or in teams of two to three students. Each project will also be assigned a 708 instructor as a project consultant/mentor. They will consult with you on your ideas, but the final responsibility to define and execute an interesting piece of work is yours. Your project will be worth 30% of your final class grade, and will have two final deliverables:

1. a writeup in the form of a NIPS paper (8 pages maximum in NIPS format, including references), due Dec 3, worth 60% of the project grade, and

2. a poster presenting your work for a special ML class poster session at the end of the semester, due Nov 30, worth 20% of the project grade.

In addition, you must turn in a midway progress report (5 pages maximum in NIPS format, including references) describing the results of your first experiments by Oct 31, worth 20% of the project grade. Note that, as with any conference, the page limits are strict! Papers over the limit will not be considered.

Project Proposal:

You must turn in a brief project proposal (1-page maximum) by Oct 10th.

You are encouraged to come up a topic directly related to your own current research project or research topics related to graphical models of your own interest that bears a non-trivial technical component (either theoretical or application-oriented), but the proposed work must be new and should not be copied from your previous published or unpublished work. For example, research on graphical models that you did this summer does not count as a class project.

You may use the list of available dataset provided bellow and pick a “less adventurous” project from the following list of potential project ideas. These data sets have been successfully used for machine learning in the past, and you can compare your results with those reported in the literature. Of course you can also choose to work on a new problem beyond our list used the provided dataset.

Project proposal format: Proposals should be one page maximum. Include the following information:

· Project title

· Project idea. This should be approximately two paragraphs.

· Software you will need to write.

· Papers to read. Include 1-3 relevant papers. You will probably want to read at least one of them before submitting your proposal

· Teammate(s): will you have teammate(s)? If so, whom? Maximum team size is three students.

· Oct 31 milestone: What will you complete by Oct 31? Experimental results of some kind are expected here.

Project suggestions:

· Ideally, you will want to pick a problem in a domain of your interest, e.g., natural language parsing, DNA sequence analysis, text information retrieval, network mining, reinforcement learning, sensor networks, etc., and formulate your problem using graphical models. You can then, for example, adapt and tailor standard inference/learning algorithms to your problem, and do a thorough performance analysis.

You can also find some project ideas below.

Project A: Brain imaging data (fMRI)

This data is available here

This data set contains a time series of images of brain activation, measured using fMRI, with one image every 500 msec. During this time, human subjects performed 40 trials of a sentence-picture comparison task (reading a sentence, observing a picture, and determining whether the sentence correctly described the picture). Each of the 40 trials lasts approximately 30 seconds. Each image contains approximately 5,000 voxels (3D pixels), across a large portion of the brain. Data is available for 12 different human subjects.
Available software: Matlab software for reading the data, manipulating and visualizing it, and for training some types of classifiers (Gassian Naive Bayes, SVM).

Project A: Bayes network classifiers for fMRI
Project idea: Gaussian Naïve Bayes classifiers and SVMs have been used with this data to predict when the subject was reading a sentence versus perceiving a picture. Both of these classify 8-second windows of data into these two classes, achieving around 85% classification accuracy [Mitchell et al, 2004]. This project will explore going beyond the Gaussian Naïve Bayes classifier (which assumes voxel activities are conditionally independent), by training a Bayes network in particular a TAN tree [Friedman, et al., 1997]. Issues youll need to confront include which features to include (5000 voxels times 8 seconds of images is a lot of features) for classifier input, whether to train brain-specific or brain-independent classifiers, and a number of issues about efficient computation with this fairly large data set.
Papers to read: "Learning to Decode Cognitive States from Brain Images," Mitchell et al., 2004, "Bayesian Network Classifiers" Friedman et al., 1997.

Project B: Image Segmentation Dataset

The goal is to segment images in a meaningful way. Berkeley collected three hundred images and paid students to hand-segment each one (usually each image has multiple hand-segmentations). Two-hundred of these images are training images, and the remaining 100 are test images. The dataset includes code for reading the images and ground-truth labels, computing the benchmark scores, and some other utility functions. It also includes code for a segmentation example. This dataset is new and the problem unsolved, so there is a chance that you could come up with the leading algorithm for your project.
http://www.cs.berkeley.edu/projects/vision/grouping/segbench/

Project ideas:
Project B: Region-Based Segmentation
Most segmentation algorithms have focused on segmentation based on edges or based on discontinuity of color and texture. The ground-truth in this dataset, however, allows supervised learning algorithms to segment the images based on statistics calculated over regions. One way to do this is to "oversegment" the image into superpixels (Felzenszwalb 2004, code available) and merge the superpixels into larger segments. Graphical models can be used to represent smoothness in clusters, by adding appropriate potentials between neighboring pixels. In this project, you can address, for example, learning of such potentials, and inference in models with very large tree-width.
Papers to read: Some segmentation papers from Berkeley are available here

Project C: Twenty Newgroups text data

This data set contains 1000 text articles posted to each of 20 online newgroups, for a total of 20,000 articles. For documentation and download, see this website. This data is useful for a variety of text classification and/or clustering projects. The "label" of each article is which of the 20 newsgroups it belongs to. The newsgroups (labels) are hierarchically organized (e.g., "sports", "hockey").

Available software: The same website provides an implementation of a Naive Bayes classifier for this text data. The code is quite robust, and some documentation is available, but it is difficult code to modify.

Project ideas:

· EM text classification in the case where you have labels for some documents, but not for others (see McCallum et al, and come up with your own suggestions)

Project D: Sensor network data

A 54-node sensor network collected temperature, humidity, and light data, along with the voltage level of the batteries at each node. The data was collected every 30 seconds, starting around 1am on February 28th 2004.

http://www-2.cs.cmu.edu/~guestrin/Research/Data/

This is a real dataset, with lots of missing data, noise, and failed sensors giving outlier values, especially when battery levels are low.

Project ideas:

· Learn graphical models representing the correlations between measurements at different nodes

· Develop new distributed algorithms for solving a learning task on this data

Papers:

· http://www-2.cs.cmu.edu/~guestrin/Publications/IPSN2004/ipsn2004.pdf

· http://www-2.cs.cmu.edu/~guestrin/Publications/VLDB04/vldb04.pdf

· Efficient Structure Learning of Markov Networks using L1-Regularization

Project E: Character recognition (digits) data

Optical character recognition, and the simpler digit recognition task, has been the focus of much ML research. We have two datasets on this topic. The first tackles the more general OCR task, on a small vocabulary of words: (Note that the first letter of each word was removed, since these were capital letters that would make the task harder for you.)

http://ai.stanford.edu/~btaskar/ocr/

Project suggestion:

· Use an HMM to exploit correlations between neighboring letters in the general OCR case to improve accuracy. (Since ZIP codes don't have such constraints between neighboring digits, HMMs will probably not help in the digit case.)

Project F: Precipitation data

This dataset has includes 45 years of daily precipitation data from the Northwest of the US:

http://www.jisao.washington.edu/data_sets/widmann/

Project ideas:

· Weather prediction: Learn a probabilistic model to predict rain levels

Project G: WebKB

This dataset contains webpages from 4 universities, labeled with whether they are professor, student, project, or other pages.

http://www-2.cs.cmu.edu/~webkb/

Project ideas:

· Assign labels to the documents using both content as well as link information. You could use a CRF like model where the hidden variables are the class labels of the web-pages and the observed variables are the words in each web-page. The undirected edges between the labels are given by the hyper-link structure with direction ignored.

Papers:

· http://www-2.cs.cmu.edu/~webkb/

· http://www.cs.berkeley.edu/~taskar/pubs/rmn.ps

Project H: Electoral Campaign Contribution data

This dataset provided below is compiled from the Federal Election Commission (http://www.fec.gov/finance/disclosure/ftpdet.shtml) and contains information about federal electoral campaign contributions from elections from 1980-2006. There are 3 types of entities: Donors, Committees, and Candidates. Donors contribute money to committees, and committees then give money to candidates. Donors are individuals, like Harry Q. Bovik or Ben Roethlisberger. Committees are organizations, and may be devoted to a single candidate or several candidates. For instance, a committee might be CMU Students for Ron Paul, or the Machine Learning Researchers for Political Action. Candidates are registered candidates for any federal election: Senate, House, or Presidential.

http://www.cs.cmu.edu/~mmcgloho/local/data/fec_data.html

The indices for all three entities list name and address data, with several additional fields. Donors also have a listed occupation. Committees have data pertaining to each committee's interest. The index for candidates also includes information on party and election status. Full lists of features may be found in the readme.

Project ideas:

Temporal Models such as HMMs or DBNs, modeling financial transactions over time.
Relational Models, predicting links between donors/committees, and committees/candidates. One could also create entities/links for features (Donor Harry Bovik ResidesIn Zip15213).
Learning causal relationships in the data.

Project I: Deduplication

The datasets provided below comprise of lists of records, and the goal is to identify, for any dataset, the set of records which refer to unique entities. This problem is known
by the varied names of Deduplication, Identity Uncertainty and Record Linkage.

http://www.cs.utexas.edu/users/ml/riddle/data.html

Project Ideas:

One common approach is to cast the deduplication problem as a classification problem. Consider the set of record-pairs, and classify them as either "unique" or "not-unique".

Papers:

Project J: Email Annotation

The datasets provided below are sets of emails. The goal is to identify which parts of the email refer to a person name. This task is an example of the general problem area of Information Extraction.

http://www.cs.cmu.edu/~einat/datasets.html

Project Ideas:

Model the task as a Sequential Labeling problem, where each email is a sequence of tokens, and each token can have either a label of "person-name" or "not-a-person-name".

Papers: http://www.cs.cmu.edu/~einat/email-2004.pdf

Project K: Inference

Comparing approximate inference for Ising models:

Ising models are discrete-state 2D grid-structured MRFs with pairwise potentials. Many models (Bayes nets, Markov nets, factor graphs) can be converted into this form. Exact inference is intractable, so people have tried various approximations, such as mean field, loopy belief propagation (BP), generalized belief propagation, Gibbs sampling, Rao-Blackwellised MCMC, Swendsen-Wang, graph cuts, etc.

The goal of this project is to empirically compare these methods on some MRF models (using other people's code), and and to make a uniform matlab interface to all the functions (so they can be interchanged in a plug-n-play fashion). To test, you can use an MRF with random parameters, but it would be better to team up with someone who is trying to learn MRF parameters from real data (see below).

The C++ code (with a Matlab wrapper) for mean field, loopy BP, generalized BP, Gibbs sampling and Swendsen-Wang, from here. Code for RB-MCMC can be obtained from Firas Hamze or Nando de Freitas. C++ graphcuts code is available (without matlab interface) here.

Some related papers you should read first:

Comparing the mean field method and belief propagation for approximate inference in MRFs Yair Weiss, 2001.

Comparison of Graph Cuts with Belief Propagation for Stereo, using Identical MRF Parameters , ICCV 2003. (He has C code available.)

Tutorial on approximate inference, Frey and Jojic, PAMI 2004

Comparing message-passing schedules for Belief Propagation:

The goal of this project is to compare the effects of the choice of the schedule of messages on the results of Loopy Belief Propagation. One of the goals would be to recreate the results of the paper Residual Belief Propagation: Informed Scheduling for Asynchronous Message Passing.

Comparing variational learning, MCMC learning and IPF of Ising models on binary images:

Simple images, such as handwritten digits can be represented by a grid of binary numbers, on which an Ising modeling can be defined. An IPF algorithm makes use of the junction tree algorithm to learn the model. In this project you are asked to plug in a mean field or generalized mean field methods for inference in the learning process, and compare the outcome with that of an IPF. See Yee Whye Tehs paper for the IPF methods and description of the data and the problem. Since variational methods optimize a lower bound of the likelihood instead of the true likelihood, your results will reveal the consequence of such approximation on learning and interesting theoretical insights.

Project L: MRF and vision:

2D CRFs for visual texture classification

Discriminative Fields for Modeling Spatial Dependencies in Natural Images is about applying 2D conditional random fields (CRFs) for classifying image regions as containing "man-made building" or not, on the basis of texture. The goal of this project is to reproduce the results in the NIPS 2003 paper. Useful links:

labeled training data.
C++ graphcuts code for approximate inference
Kevin Murphys Matlab CRF code
Carl Rasmussen's matlab conjugate gradient minimizer (better than using netlab or matlab optimization toolbox)
Intro to CRFs by Hanna Wallach
Maxent page, includes code
Steerable pyramid matlab code, possibly useful set of image features
Matlab wavelet toolbox, possibly useful set of image features .
Paper of CRFs for sign detection, J. Weinman, 2004
Markov Random Field Modeling in Computer Vision, S. Z. Li, 1995.
G. Winkler, "Image Analysis, Random Fields, and MCMC Methods", 2nd edition, 2003.
Markov random fields and images, P. Perez. CWI Quarterly, 11(4):413-437, 1998. Review article.

2D CRFs for satellite image classification

The goal of this project is to classify pixels in satellite image data into classes like field vs road vs forest, using MRFs/CRFs (see above), or some other technique. Some possibly useful links:

Fully Bayesian Image Segmentation -- an Engineering Perspective, Morris et al, 1996.
A binary tree-structured MRF model for multispectral satellite image segmentation ,2003

Project M: Unsupervised Parts of Speech tagging

Dataset: Brown Corpus http://dingo.sbs.arizona.edu/~hammond/ling696f-sp03/browncorpus.txt

Project ideas:

Assume a chain graphical model and learn the parameters and parts of speech labels.

Project N: Video Tracking

Object tracking and trajectory modeling using a non-linear dynamic model based on HMM or state-space model (e.g., input-output HMM, factorial HMM, switching SSM)

The goal of this project is to reproduce the results in the following paper: Transformed hidden Markov models: Estimating mixture models of images and inferring spatial transformations in video sequences (CVPR 2000). Note that Brendan Frey has Matlab code for transformation invariant EM on his home page. See also Real-time On-line Learning of Transformed Hidden Markov Models from Video, Nemanja Petrovic, Nebojsa Jojic, Brendan J. Frey, Thomas S, Huang, AIstats 2003, which is 10,000 times faster!

Project O: Context-specific independence

We learned in class that CSI can speed-up inference. In this project, you can explore this further. For example, implement the recursive conditioning approach of Adnan Darwiche, and compare it to variable elimination and clique trees. When is recursive conditioning faster? Can you find practical BNs where the speed-up is considerable? Can you learn such BNs from data?

Project P: More data

There are many other datasets out there. UC Irvine has a repository that could be useful for you project:

http://www.ics.uci.edu/~mlearn/MLRepository.html

Sam Roweis also has a link to several datasets out there:

http://www.cs.toronto.edu/~roweis/data.html