Machine Learning
10-701/15-781, Spring 2010Eric Xing, Tom Mitchell, Aarti Singh School of Computer Science, Carnegie-Mellon University |
Course Project
Your class project is an opportunity for you to explore an interesting multivariate analysis problem of your choice in the context of a real-world data set. Projects can be done by you as an individual, or in teams of two to three students. Each project will also be assigned a 701 instructor as a project consultant/mentor. Instructors and TAs will consult with you on your ideas, but of course the final responsibility to define and execute an interesting piece of work is yours. Your project will be worth 20% of your final class grade, and will have 4 deliverables:
- Proposal:1 page (10%), Due: Feb 24th
- Midway Report:3-4 pages (20%), Due: March 31st
- Final Report: 8 pages (40%) Due: Wednesday May 5th @ midnight by email to the instructor list
- Poster
Presentation :
(30%), on Tuesday May 4th (3-6pm) NSH Atrium
Note that all write-ups in the form of a NIPS paper. The page limits are strict! Papers over the limit will not be considered.
Project Proposal
You must turn in a brief project proposal (1-page maximum). Read the list of available data sets and potential project ideas below. You are highly recommended to use one of these data sets, because we know that they have been successfully used for machine learning in the past. If you have another data set you want to work on, you can discuss it with us. However, we will not allow projects on data that has not been collected, so you have to work on existing data sets. It is also possible to propose a project on some theoretical aspects of machine learning. If you want to do this, please discuss it with us. Note that even though you can use data sets you have used before, you cannot use as class projects something that you started doing prior to the class.
Project proposal format: Proposals should be one page maximum. Include the following information:
- Project title
- Data set
- Project idea. This should be approximately two paragraphs.
- Software you will need to write.
- Papers to read. Include 1-3 relevant papers. You will probably want to read at least one of them before submitting your proposal
- Teammate: will you have a teammate? If so, whom? Maximum team size is two students. We expect projects done in a group to be more substantial than projects done individually.
- March 31st milestone: What will you complete by March 31st; Experimental results of some kind are expected here. You should also describe what portion of the project each partner will be doing.
Midway Report
This should be a 3-4 pages short report, and it serves as a check-point. It should consist of the same sections as your final report (introduction, related work, method, experiment, conclusion), with a few sections `under construction'. Specifically, the introduction and related work sections should be in their final form; the section on the proposed method should be almost finished; the sections on the experiments and conclusions will have whatever results you have obtained, as well as `place-holders' for the results you plan/hope to obtain.
Grading scheme for the project report:
- 70% for proposed method (should be almost finished)
- 25% for the design of upcoming experiments
- 5% for plan of activities (in an appendix, please show the old one and the revised one, along with the activities of each group member)
Final Report
Your final report is expected to be a 8-page report. You should submit both an electronic and a hardcopy version for your final report. It should roughly have the following format:
- Introduction - Motivation
- Problem definition
- Proposed method
- Intuition - why should it be better than the state of the art?
- Description of its algorithms
- Experiments
- Description of your testbed; list of questions your experiments are designed to answer
- Details of the experiments; observations
- Conclusions
Poster Presentation
We will have all projects presenting a poster, on Project
poster
session : Tuesday, May 4th, 3:00pm-6:00pm at NSH Atrium . At least one project member should be present during the
poster
hours. The session will be open
to everybody.
Project Suggestions:
Ideally, you will want to pick a
problem in a
domain of your interest, e.g., natural language parsing, DNA sequence
analysis, text
information retrieval, network mining, reinforcement learning, sensor
networks, etc., and formulate
your problem using machine learning techniques. You can then, for
example, adapt
and tailor standard
inference/learning algorithms to your problem, and do a thorough
performance
analysis. You
can also find some project ideas below.
Project A1: Cognitive State Classification with Magnetoencephalography Data (MEG)
Data:
A zip file containing some example preprocessing of the data into features along with some text file descriptions: LanguageFiles.zipThe raw time data (12 GB) for two subjects (DP/RG_mats) and the FFT data (DP/RG_avgPSD) is located at:
/afs/cs.cmu.edu/project/theo-23/meg_pilot
You should access this directly through AFS space
This data set contains a time series of images of brain activation, measured using MEG. Human subjects viewed 60 different objects divided into 12 categories (tools, foods, animals, etc...). There are 8 presentations of each object, and each presentation lasts 3-4 seconds. Each second has hundreds of measurements from 300 sensors. The data is currently available for 2 different human subjects.
Project A: Building a cognitive state
classifier
Project idea: We would like to build classifiers to
distinguish between the different categories of objects (e.g. tools vs.
foods) or even the objects themselves if possible (e.g. bear vs. cat).
The exciting thing is that no one really knows how well this will work
(or if it's even possible). This is because the data was only gathered
a few weeks ago (Aug-Sept 08). One of the main challenges is figuring
out how to make good features
from the raw data. Should the raw data just be used? Or maybe it should
be first passed through a low-pass filter? Perhaps a FFT should convert
the time series to the frequency domain first? Should the features
represent absolute sensor values or should they represent changes from
some baseline? If so, what baseline? Another challenge is discovering
what features are useful for what tasks. For example, the features that
may
distinguish foods from animals may be different than those that
distinguish tools from buildings. What are
good ways to discover these features?
This project is more challenging and risky than the others because it
is not known what the results will be. But this is also good because no
one else knows either, meaning that a good result could lead to a
possible publication.
Papers to read:
Relevant but in the fMRI domain:
Learning to Decode Cognitive States from
Brain Images,
Mitchell et al., 2004,
Predicting Human Brain Activity Associated with the Meanings of Nouns,
Mitchell et al., 2008
MEG paper:
Predicting
the recognition of natural scenes from single trial MEG recordings of
brain activity, Rieger et al. 2008 (access from CMU domain)
Project A2: Brain imaging data (fMRI)
This data set contains a time series of images of brain activation, measured using fMRI, with one image every 500 msec. During this time, human subjects performed 40 trials of a sentence-picture comparison task (reading a sentence, observing a picture, and determining whether the sentence correctly described the picture). Each of the 40 trials lasts approximately 30 seconds. Each image contains approximately 5,000 voxels (3D pixels), across a large portion of the brain. Data is available for 12 different human subjects.
Available software: we can provide Matlab software for reading the data, manipulating and visualizing it, and for training some types of classifiers (Gassian Naive Bayes, SVM).
Project A: Bayes network classifiers for fMRI
Project idea: Gaussian Naive Bayes classifiers
and SVMs have been used with this data to predict when
the subject was reading a sentence versus perceiving a picture. Both of
these classify 8-second windows of data into these two classes,
achieving
around 85% classification accuracy [Mitchell et al, 2004]. This project
will
explore going beyond the Gaussian Naive Bayes classifier
(which assumes voxel activities are conditionally
independent), by training a Bayes network in
particular a TAN tree [Friedman, et al., 1997]. Issues you'll
need to confront include which features to include (5000 voxels
times 8 seconds of images is a lot of features) for classifier input,
whether to train brain-specific or brain-independent classifiers, and a
number
of issues about efficient computation with this fairly large data set.
Papers to read: "
Learning to Decode Cognitive States from Brain Images",
Mitchell et al., 2004, "
Bayesian Network Classifiers", Friedman et al., 1997.
Project A3: Genetic Sequence Analysis
We don't currently have a specific dataset in mind for this project, but if you're interested we'll help you find one (ask Field).
One of the most interesting, and controversial, areas of modern science is using people's genetic code to predict things like their likelihood of getting heart disease, their athletic prowess, and even their personality and intelligence. The movie Gattaca shows some of the downsides of this technology, but it can also be immensely helpful if a person takes preventive measures. Also, many drugs work better for people with certain genes. Insurance problems notwithstanding, genetic screening will play a huge role in medicine in the coming decades.
This project is not as well-defined as many others, but the idea is to get ahold of genetic data from patients, along with some kind of phenotype marker (like whether they got a disease), and try to find patterns within the genetic code which predict the trait. This area is very exciting because in many cases, people have literally no idea what causal links exist between genes and traits, but finding these links can be a huge boost to both medicine and pure science (by telling scientists which particular gene combinations to examine)
Project A4: Hierarchical Bayes Topic Models
Statistical topic models have recently gained much popularity in managing large collection of text documents. These models make the fundamental assumption that a document is a mixture of topics(as opposed to clustering in which we assume that a document is generated from a single topic), where the mixture proportions are document-specific, and signify how important each topic is to the document. Moreover, each topic is a multinomial distribution over a given vocabulary which in turn dictates how important each word is for a topic. The document- specific mixture proportions provide a low-dimensional representation of the document into the topic-space. This representation captures the latent semantic of the collection and can then be used for tasks like classifications and clustering, or merely as a tool to structurally browse the otherwise unstructured collection. The most famous of such models is known as LDA ,Latent Dirichlet Allocation (Blei et. al. 2003). LDA has been the basis for many extensions in text, vision, bioiformatic, and social networks. These extensions incorporate more dependency structures in the generative process like modeling authors-topic dependency, or implement more sophisticated ways of representing inter-topic relationships.Potential projects include
- Implement one of the models listed below or propose a new latent topic model that suits a data set in your area of interest
- Implement and Compare approximate inference algorithms for LDA which includes: variational inference (Blei et. al. 2003), collapsed gibbs sampling (Griffth et. al. 2004) and (optionally) collapsed variational inference (Teh. et. al. 2006). You should compare them over simulated data by varying the corpus generation parameters --- number of optics, size of vocabulary, document length, etc --- in addition to comparison over several real world datasets.
Papers:
Inference:
- D. Blei, A. Ng, and M. Jordan. Latent Dirichlet allocation. Journal of Machine Learning Research, 3:993–1022, January 2003.
[pdf] - Griffiths, T, Steyvers, M. (2004). Finding scientific topics. Proceedings of the National Academy of Sciences, 101, 5228-5235 2004.
[pdf] - Y.W. Teh, D. Newman and M. Welling. A Collapsed Variational Bayesian Inference Algorithm for Latent Dirichlet Allocation.In NIPS 2006.
[pdf]
Expressive Models:
- Rosen-Zvi, M., Griffiths, T., Steyvers, M., & Smyth, P. The Author-Topic Model for authors and documents.In UAI 2004.
[pdf] - Jun Zhu, Amr Ahmed and Eric Xing. MedLDA: Maximum Margin Supervised Topic Models for Regression and Classification. International conference of Machine learning. ICML 2009.
[pdf] - D. Blei, J. McAuliffe. Supervised topic models. In Advances in Neural Information Processing Systems 21, 2007
[pdf] - Wei Li and Andrew McCallum. Pachinko Allocation: Scalable Mixture Models of Topic Correlations. Submitted to the Journal of Machine Learning Research, (JMLR), 2008
[pdf]
- L. Fei-Fei and P. Perona. A Bayesian Hierarchical Model for Learning Natural Scene Categories. IEEE Comp. Vis. Patt. Recog. 2005. [PDF]
- L. Cao and L. Fei-Fei. Spatially coherent latent topic model for concurrent object segmentation and classification . IEEE Intern. Conf. in Computer Vision (ICCV). 2007 [PDF]
- Ramesh Nallapati, Amr Ahmed, Eric P. Xing, and William W. Cohen, Joint Latent Topic Models for Text and Citations. Proceedings of The Fourteen ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. (KDD 2008) [PDF]
- Erosheva, Elena A., Fienberg, Stephen E., and Lafferty, John (2004). Mixed-membership models of scientific publications," Proceedings of the National Academy of Sciences, 97, No. 22, 11885-11892. [PDF]
- E. Airoldi, D. Blei, E.P. Xing and S. Fienberg, Mixed Membership Model for Relational Data. JMLR 2008. [PDF]
- Andrew McCallum, Andres Corrada-Emmanuel, Xuerui Wang The Author-Recipient-Topic Model for Topic and Role Discovery in Social Networks: Experiments with Enron and Academic Email. Technical Report UM-CS-2004-096, 2004. [PDF]
- E.P. Xing, W. Fu, and L. Song, A State-Space Mixed Membership Blockmodel for Dynamic Network Tomography, Annals of Applied Statistics, 2009. [PDF]
- S. Shringarpure and E. P. Xing, mStruct: A New Admixture Model for Inference of Population Structure in Light of Both Genetic Admixing and Allele Mutations, Proceedings of the 25th International Conference on Machine Learning (ICML 2008). [PDF]
- Amr Ahmed, Eric P. Xing, William W. Cohen, Robert F. Murphy. Structured Correspondence Topic Models for Mining Captioned Figures in Biological Literature. Proceedings of The Fifteenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, (KDD 2009) [PDF]
Project B: Image Segmentation Dataset
The goal is to segment images in a meaningful way.
Berkeleycollected three hundred images and
paid students to hand-segment each one (usually each image has multiple
hand-segmentations).
Two-hundred of these images are training images, and the remaining 100
are test
images. The dataset includes code for reading the images and
ground-truth
labels, computing the benchmark scores, and some other utility
functions.
It also includes code for a segmentation example. This
dataset is
new and
the problem unsolved, so there is a chance that you could come up with
the
leading algorithm for your project.
http://www.cs.berkeley.edu/projects/vision/grouping/segbench/
Project ideas:
Project B: Region-Based Segmentation
Most segmentation algorithms have focused on segmentation based on
edges or
based on discontinuity of color and texture. The ground-truth
in
this
dataset, however, allows supervised learning algorithms to segment the
images
based on statistics calculated over regions. One way to do
this
is to
"oversegment" the image
into superpixels (Felzenszwalb
2004,
code available) and merge the superpixels
into larger
segments. Graphical models can be used to represent
smoothness in
clusters, by adding appropriate potentials between neighboring pixels.
In this project, you can address, for example, learning of such
potentials, and inference in models with very large tree-width.
Papers to read: Some segmentation papers from
Project C: Twenty Newgroups text data
This data set contains 1000 text articles posted to each of 20
online newgroups, for a
total of 20,000
articles. For
documentation and download, see this
website.
This data is useful for a variety of text classification and/or
clustering
projects. The "label" of each article is which of the 20
newsgroups it belongs to. The newsgroups (labels) are
hierarchically
organized (e.g., "sports", "hockey").
Available software: The same website provides an
implementation
of a
Naive Bayes classifier
for this text
data. The
code is quite robust, and some documentation is available, but it is
difficult
code to modify.
Project ideas:
EM text classification in the case where you
have labels for some documents, but not for others (see
McCallum
et al,
and come up with your own suggestions)
Project E: Character recognition (digits) data
Optical character recognition, and the simpler digit recognition task, has been the focus of much ML research. We have two datasets on this topic. The first tackles the more general OCR task, on a small vocabulary of words: (Note that the first letter of each word was removed, since these were capital letters that would make the task harder for you.)
http://ai.stanford.edu/~btaskar/ocr/
Project suggestion:
- Use an HMM to exploit correlations between neighboring letters in the general OCR case to improve accuracy. (Since ZIP codes don't have such constraints between neighboring digits, HMMs will probably not help in the digit case.)
Project F: NBA statistics data
This download contains 2004-2005 NBA and ABA stats for:
-Player regular season stats
-Player regular season career totals
-Player playoff stats
-Player playoff career totals
-Player all-star game stats
-Team regular season stats
-Complete draft history
-coaches_season.txt - nba coaching records by season
-coaches_career.txt - nba career coaching records
Currently all of the regular season
Project idea:
- outlier detection on the players; find out who are the outstanding players.
- predict the game outcome.
Project G: Precipitation data
This dataset has includes 45 years of daily precipitation data from the Northwest of the US:
http://www.jisao.washington.edu/data_sets/widmann/
Project ideas:
Weather prediction: Learn a probabilistic model to predict rain levels
Sensor selection: Where should you place sensor to best predict rain
Project H: WebKB
This dataset contains webpages from 4 universities, labeled with whether they are professor, student, project, or other pages.
http://www-2.cs.cmu.edu/~webkb/
Project ideas:
- Learning classifiers to predict the type of webpage from the text
- Can you improve accuracy by exploiting correlations between pages that point to each other using graphical models?
Papers:
Project I: Deduplication
The datasets provided below comprise of lists of records, and the goal is to identify, for any dataset, the set of records which refer to unique entities. This problem is known
by the varied names of Deduplication, Identity Uncertainty and Record Linkage.
http://www.cs.utexas.edu/users/ml/riddle/data.html
Project Ideas:
- One common approach is to cast the deduplication problem as a classification problem. Consider the set of record-pairs, and classify them as either "unique" or "not-unique".
Papers:
Project J: Email Annotation
The datasets provided below are sets of emails. The goal is to identify which parts of the email refer to a person name. This task is an example of the general problem area of Information Extraction.
http://www.cs.cmu.edu/~einat/datasets.html
Project Ideas:
- Model the task as a Sequential Labeling problem,
where each
email is a sequence of tokens, and each token can have either a label
of "person-name" or "not-a-person-name".
Papers: http://www.cs.cmu.edu/~einat/email.pdf
Project K: Netflix Prize Dataset
The Netflix Prize data set gives 100 million records of the form "user X rated movie Y a 4.0 on 2/12/05". The data is available here: Netflix Prize
Project idea:
-
Can you predict the rating a user will give on a movie from the movies that user has rated in the past, as well as the ratings similar users have given similar movies?
-
Can you discover clusters of similar movies or users?
-
Can you predict which users rated which movies in 2006? In other words, your task is to predict the probability that each pair was rated in 2006. Note that the actual rating is irrelevant, and we just want whether the movie was rated by that user sometime in 2006. The date in 2006 when the rating was given is also irrelevant. The test data can be found at this website.
Project L: Physiological Data Modeling (bodymedia)
Physiological data offers many challenges to the machine learning community including dealing with large amounts of data, sequential data, issues of sensor fusion, and a rich domain complete with noise, hidden variables, and significant effects of context.
1. Which sensors correspond to each column?
characteristic1 | age |
characteristic2 | handedness |
sensor1 | gsr_low_average |
sensor2 | heat_flux_high_average |
sensor3 | near_body_temp_average |
sensor4 | pedometer |
sensor5 | skin_temp_average |
sensor6 | longitudinal_accelerometer_SAD |
sensor7 | longitudinal_accelerometer_average |
sensor8 | transverse_accelerometer_SAD |
sensor9 | transverse_accelerometer_average |
2. What are the activities behind each annotation?
The annotations for the contest were:
5102 = sleep
3104 = watching TV
Datasets can be downloaded from http://www.cs.utexas.edu/users/sherstov/pdmc/
Project idea:
- behavior classification; to classify the person based on the sensor measurements
Project M: Object Recognition
The Caltech 256 dataset
contains images
of 256 object categories taken at varying orientations, varying
lighting conditions, and with different backgrounds.
http://www.vision.caltech.edu/Image_Datasets/Caltech256/
Project ideas:
- You can try to create an object recognition system which can identify which object category is the best match for a given test image.
- Apply clustering to learn object categories without supervision
Project N: Learning POMDP structure so as to maximize utility
Hoey & Little (CVPR 04)
show how to
learn the state
space, and parameters, of a POMDP so as to maximize utility in a visual
face
gesture recognition task. (This is similar to the concept of "utile
distinctions" developed in Andrew
McCallum's PhD
thesis.) The goal of this project is to reproduce Hoey's
work in a simpler (non-visual) domain, such as McCallum's driving task.
Project O: Learning partially observed MRFs: the Langevin algorithm
In the recently proposed exponential
family
harmonium model (Welling
et. al., Xing
et. al.), a constructive divergence (CD) algorithm was used
to
learn the
parameters of the model (essentially a partially observed, two-layer
MRF). In
Xing et. al., a
comparison to variational
learning was performed. CD is essentially a gradient ascent algorithm
of which
the gradient is approximated by a few samples. The Langevin method
adds a random
perturbation to the gradient and can often help to get the learning
process out
of local optima. In this project you will implement the Langevin
learning algorithm for Xings dual wing harmonium model, and test your
algorithm
on the data in my UAI paper. See Zoubin
Ghahramanis paper
of Bayesian learning of MRF for reference.
Project P: Context-specific independence
We learned in class that CSI can speed-up inference. In this project, you can explore this further. For example, implement the recursive conditioning approach of Adnan Darwiche, and compare it to variable elimination and clique trees. When is recursive conditioning faster? Can you find practical BNs where the speed-up is considerable? Can you learn such BNs from data?Project Q: Enron E-mail Dataset
The Enron E-mail data set contains about 500,000 e-mails from about 150 users. The data set is available here: Enron Data
Project ideas:
-
Can you classify the text of an e-mail message to decide who sent it?
Project R: More data
There are many other datasets out there. UC Irvine has a repository that could be useful for you project:
http://www.ics.uci.edu/~mlearn/MLRepository.html
Sam Roweis also has a link to several datasets out there:
[validate xhtml]