10:601 Machine Learning class projects

Spring 2008



Course Project

Your class project is an opportunity for you to explore an interesting machine learning problem of your choice in the context of a real-world data set.  We recommend projects we done in teams of two students, but if you prefer to work alone you may propose to do so.  Each project will also be assigned a class instructor as a project consultant/mentor. Instructors and TAs will consult with you on your ideas, but of course the final responsibility to define and execute an interesting piece of work is yours. Your project will be worth 25% of your final class grade, and will have 4 deliverables:

  1. Proposal: 1 page (10%) due by the end of the day on tuesday March 25, 2008
  2. Midway Report: 2-4 pages (20%) due April 14
  3. Poster Presentation: (30%) due April 30 
  4. Final Report: 5-8 pages (40%) due May 2

Note that all write-ups will be in the format of a NIPS paper. The page limits are strict! Papers over the limit will not be considered. 

You might like to use our class Wiki to suggest and look for projects and partners.

Project Proposal:

You must turn in a brief project proposal (1-page maximum).  Read the list of available data sets and potential project ideas below.  We highly recommend you use one of these data sets, because we know that they have been successfully used for machine learning in the past. If you have another data set you want to work on, you may propose that as well. However, if you propose a project involving another data set, your proposal must state explicitly that you already have the data in hand, and it must include a pointer to the data where we can access it. I t is also possible to propose a project on some theoretical aspects of machine learning.  Note that even though you can use data sets you have used before, you cannot use as class projects something that you started doing prior to the class.

Project proposal format:  Proposals should be one page maximum.  Include the following information:

Midway Report:

This should be a 3-4 pages short report, and it serves as a check-point. It should consist of the same sections as your final report (introduction, related work, method, experiment, conclusion), with a few sections `under construction'. Specifically, the introduction and related work sections should be in their final form; the section on the proposed method should be almost finished; the sections on the experiments and conclusions will have whatever results you have obtained, as well as `place-holders' for the results you plan/hope to obtain.

Grading scheme for the project report:

Final Report:

Your final report is expected to be a 8-page report. You should submit both an electronic and a hardcopy version for your final report. It should roughly have the following format:

Poster Presentation:

We will have all projects presenting a poster, on Project poster session (tentative): Thu May 1 4-6:30pm in the Newell-Simon Hall Atrium. At least one project member should be present during the poster hours. The session will be open to everybody.


Project Suggestions:


Ideally, you will want to pick a problem in a domain of your interest, e.g., natural language parsing, DNA sequence analysis, text information retrieval, network mining, reinforcement learning, sensor networks, etc., and formulate your problem using machine learning techniques. You can then, for example, adapt and tailor standard inference/learning algorithms to your problem, and do a thorough performance analysis. You can also find some project ideas below.


Project A: Brain imaging data (fMRI)

This data is available here

This data set contains a time series of images of brain activation, measured using fMRI, with one image every 500 msec. During this time, human subjects performed 40 trials of a sentence-picture comparison task (reading a sentence, observing a picture, and determining whether the sentence correctly described the picture). Each of the 40 trials lasts approximately 30 seconds. Each image contains approximately 5,000 voxels (3D pixels), across a large portion of the brain. Data is available for 12 different human subjects. 
Available software: we can provide Matlab software for reading the data, manipulating and visualizing it, and for training some types of classifiers (Gassian Naive Bayes, SVM).


Project A: Bayes network classifiers for fMRI
Project idea: Gaussian Naive Bayes classifiers and SVMs have been used with this data to predict when the subject was reading a sentence versus perceiving a picture. Both of these classify 8-second windows of data into these two classes, achieving around 85% classification accuracy [Mitchell et al, 2004]. This project will explore going beyond the Gaussian Naive Bayes classifier (which assumes voxel activities are conditionally independent), by training a Bayes network in particular a TAN tree [Friedman, et al., 1997]. Issues youll need to confront include which features to include (5000 voxels times 8 seconds of images is a lot of features) for classifier input, whether to train brain-specific or brain-independent classifiers, and a number of issues about efficient computation with this fairly large data set.
Papers to read: "Learning to Decode Cognitive States from Brain Images," Mitchell et al., 2004, "Bayesian Network Classifiers" Friedman et al., 1997.



Project B: Image Segmentation Dataset


The goal is to segment images in a meaningful way.  Berkeleycollected three hundred images and paid students to hand-segment each one (usually each image has multiple hand-segmentations).   Two-hundred of these images are training images, and the remaining 100 are test images.  The dataset includes code for reading the images and ground-truth labels, computing the benchmark scores, and some other utility functions.  It also includes code for a segmentation example.  This dataset is new and the problem unsolved, so there is a chance that you could come up with the leading algorithm for your project.
http://www.cs.berkeley.edu/projects/vision/grouping/segbench/

Project ideas:
Project B: Region-Based Segmentation
Most segmentation algorithms have focused on segmentation based on edges or based on discontinuity of color and texture.  The ground-truth in this dataset, however, allows supervised learning algorithms to segment the images based on statistics calculated over regions.  One way to do this is to "oversegment" the image into superpixels (Felzenszwalb 2004, code available) and merge the superpixels into larger segments.  Graphical models can be used to represent smoothness in clusters, by adding appropriate potentials between neighboring pixels. In this project, you can address, for example, learning of such potentials, and inference in models with very large tree-width.
Papers to read: Some segmentation papers from Berkeley are available here



Project C: Twenty Newgroups text data

This data set contains 1000 text articles posted to each of 20 online newgroups, for a total of 20,000 articles.  For documentation and download, see this website.  This data is useful for a variety of text classification and/or clustering projects.  The "label" of each article is which of the 20 newsgroups it belongs to.  The newsgroups (labels) are hierarchically organized (e.g., "sports", "hockey").

Available software: The same website provides an implementation of a Naive Bayes classifier for this text data.  The code is quite robust, and some documentation is available, but it is difficult code to modify.

Project ideas:
 

EM text classification in the case where you have labels for some documents, but not for others  (see McCallum et al, and come up with your own suggestions)
 


Project E: Character recognition (digits) data

Optical character recognition, and the simpler digit recognition task, has been the focus of much ML research. We have two datasets on this topic. The first tackles the more general OCR task, on a small vocabulary of words: (Note that the first letter of each word was removed, since these were capital letters that would make the task harder for you.)

http://ai.stanford.edu/~btaskar/ocr/

Project suggestion:


Project F: NBA statistics data

This download contains 2004-2005 NBA and ABA stats for:

-Player regular season stats
-Player regular season career totals
-Player playoff stats
-Player playoff career totals
-Player all-star game stats
-Team regular season stats
-Complete draft history
-coaches_season.txt - nba coaching records by season
-coaches_career.txt - nba career coaching records

Currently all of the regular season

Project idea:


Project G: Precipitation data

This dataset has includes 45 years of daily precipitation data from the Northwest of the US:

http://www.jisao.washington.edu/data_sets/widmann/

Project ideas:

Weather prediction: Learn a probabilistic model to predict rain levels

Sensor selection: Where should you place sensor to best predict rain  


Project H: WebKB

This dataset contains webpages from 4 universities, labeled with whether they are professor, student, project, or other pages.

http://www-2.cs.cmu.edu/~webkb/
 

Project ideas:

Papers:


Project I: Deduplication

The datasets provided below comprise of lists of records, and the goal is to identify, for any dataset, the set of records which refer to unique entities. This problem is known
by the varied names of Deduplication, Identity Uncertainty and Record Linkage.

http://www.cs.utexas.edu/users/ml/riddle/data.html

Project Ideas:

Papers:

Project J: Email Annotation

The datasets provided below are sets of emails. The goal is to identify which parts of the email refer to a person name. This task is an example of the general problem area of Information Extraction.

http://www.cs.cmu.edu/~einat/datasets.html

Project Ideas:


Papers: http://www.cs.cmu.edu/~einat/email-2004.pdf


Project K: Netflix Prize Dataset

The Netflix Prize data set gives 100 million records of the form "user X rated movie Y a 4.0 on 2/12/05". The data is available here: Netflix Prize

Project idea:


Project L: Physiological Data Modeling (bodymedia)

Physiological data offers many challenges to the machine learning community including dealing with large amounts of data, sequential data, issues of sensor fusion, and a rich domain complete with noise, hidden variables, and significant effects of context.


1. Which sensors correspond to each column?
¡¡
characteristic1 age
characteristic2 handedness
sensor1 gsr_low_average
sensor2 heat_flux_high_average
sensor3 near_body_temp_average
sensor4 pedometer
sensor5 skin_temp_average
sensor6 longitudinal_accelerometer_SAD
sensor7 longitudinal_accelerometer_average
sensor8 transverse_accelerometer_SAD
sensor9 transverse_accelerometer_average


2. What are the activities behind each annotation?

The annotations for the contest were:
5102 = sleep
3104 = watching TV

Datasets can be downloaded from http://www.cs.utexas.edu/users/sherstov/pdmc/

 

Project idea:


Project M: Object Recognition

The Caltech 256 dataset contains images of 256 object categories taken at varying orientations, varying lighting conditions, and with different backgrounds.
http://www.vision.caltech.edu/Image_Datasets/Caltech256/

Project ideas:


Project N: Learning POMDP structure so as to maximize utility


Hoey & Little (CVPR 04) show how to learn the state space, and parameters, of a POMDP so as to maximize utility in a visual face gesture recognition task. (This is similar to the concept of "utile distinctions" developed in Andrew McCallum's PhD thesis.) The goal of this project is to reproduce Hoey's work in a simpler (non-visual) domain, such as McCallum's driving task.


Project O: Learning partially observed MRFs: the Langevin algorithm


In the recently proposed exponential family harmonium model (Welling et. al., Xing et. al.), a constructive divergence (CD) algorithm was used to learn the parameters of the model (essentially a partially observed, two-layer MRF). In Xing et. al., a comparison to variational learning was performed. CD is essentially a gradient ascent algorithm of which the gradient is approximated by a few samples. The Langevin method adds a random perturbation to the gradient and can often help to get the learning process out of local optima. In this project you will implement the Langevin learning algorithm for Xings dual wing harmonium model, and test your algorithm on the data in my UAI paper. See Zoubin Ghahramanis paper of Bayesian learning of MRF for reference.


Project P: Context-specific independence

We learned in class that CSI can speed-up inference. In this project, you can explore this further. For example, implement the recursive conditioning approach of Adnan Darwiche, and compare it to variable elimination and clique trees. When is recursive conditioning faster? Can you find practical BNs where the speed-up is considerable? Can you learn such BNs from data?

Project Q: Enron E-mail Dataset

The Enron E-mail data set contains about 500,000 e-mails from about 150 users. The data set is available here: Enron Data


Project ideas:


Project R: More data

There are many other datasets out there. UC Irvine has a repository that could be useful for you project:

http://www.ics.uci.edu/~mlearn/MLRepository.html

Sam Roweis also has a link to several datasets out there: http://www.cs.toronto.edu/~roweis/data.html