10708 Probabilistic Graphical Models

Course Project

Your class project is an opportunity for you to explore an interesting problem in the context of a real-world data sets. Projects should be done in teams of three students. Each project will be assigned a TA as a project consultant/mentor; instructors and TAs will consult with you on your ideas, but of course the final responsibility to define and execute an interesting piece of work is yours. Your project will be worth 40% of your final class grade, and will have 4 deliverables:

Proposal : 2 pages excluding references (10%)
Due : 12pm Feb 17
Midway Report : 5 pages excluding references (20%)
Due : TBD
Final Report : 8 pages excluding references (40%)
email to pgm.asst.2016@gmail.com; include project name in the email subject
Due : May 6, 11:59pm
Presentation : (30%)
April 29th, Baker Hall A51:
Session I: 8:30am -- 12:30pm (4 hrs)
Lunch break: 12:30am -- 1:30pm (1 hr)
Session II: 1:30pm -- 5:00pm (3.5 hrs)

All write-ups should use the ICML style.

Team Formation

You are responsible for forming project teams of 3 people. In some cases, we will also accept teams of 2, but a 3-person group is preferred. Once you have formed your group, please send one email per team to the class instructor list with the names of all team members. If you have trouble forming a group, please send us an email and we will help you find project partners.

Project Suggestions

Please see the list of suggested projects: posted here.

Project Proposal

You must turn in a brief project proposal that provides an overview of your idea and also contains a brief survey of related work on the topic. We will provide a list of suggested project ideas for you to choose from, and we strongly suggest using one of these ideas, though you may discuss other project ideas with us, whether applied or theoretical. Note that even though you can use data sets you have used before, you cannot use work that you started prior to this class as your project.

Proposals should be approximately two pages long, and should include the following information:

Project title and list of group members.
Overview of project idea. This should be approximately half a page long.
A short literature survey of 5 or more relevant papers. The literature review should take up approximately one page.
Description of potential data sets to use for the experiments.
Plan of activities, including what you plan to complete by the midway report and how you plan to divide up the work.

The grading breakdown for the proposal is as follows:

40% for clear and concise description of proposed method
40% for literature survey that covers at least 5 relevant papers
10% for plan of activities
10% for quality of writing

The project proposal will be due on Wednesday, February 17th, and a printed hard copy should be submitted in class.

Midway Report

The midway report will serve as a check-point at the halfway mark of your project. It should be about 5 pages long, and should be formatted like a conference paper, with the following sections: introduction, background & related work, methods, experiments, conclusion. The introduction and related work sections should be in their final form; the section on the proposed methods should be almost finished; the sections on the experiments and conclusions will have the results you have obtained, perhaps with place-holders for the results you plan/hope to obtain.

The grading breakdown for the midway report is as follows:

20% for introduction and literature survey
40% for proposed method
20% for the design of upcoming experiments and revised plan of activities (in an appendix, please show the old and new activity plans)
10% for data collection and preliminary results
10% for quality of writing

The project midway report will be due on Wednesday, Mar 23th, and a printed hard copy should be submitted in class.

Final Report

Your final report is expected to be 8 pages excluding references, in accordance with the length requirements for an ICML paper. It should have roughly the following format:

Introduction: problem definition and motivation
Background & Related Work: backround info and literature survey
Methods

Overview of your proposed method
Intuition on why should it be better than the state of the art
Details of models and algorithms that you developed

Experiments

Description of your testbed and a list of questions your experiments are designed to answer
Details of the experiments and results

Conclusion: discussion and future work

The grading breakdown for the final report is as follows:

10% for introduction and literature survey
30% for proposed method (soundness and originality)
30% for correctness, completeness, and difficulty of experiments and figures
10% for empirical and theoretical analysis of results and methods
20% for quality of writing (clarity, organization, flow, etc.)

The final report will be due on May 6th, and should be submitted electronically to pgm.asst.2016@gmail.com. Note that we will be posting all of the reports on the class webpage once the semester ends.

Presentation

All project teams will present their work at the end of the semester. Each team will be given a timeslot during which they will give a slide presentation to the class, similar in style to a conference presentation. If applicable, live demonstrations of your software are highly encouraged. The presentations will be held in Baker Hall A51 over two 3-4 hour sessions on April 29. The schedule for presentations is here.

Project Suggestions:

If you are interested in a particular project, please contact the respective Contact Person to get further ideas or details.
We may add more project suggestions down the road.

1) Efficient correlated topic modeling via embedding

Latent Dirichlet Allocation (LDA) is one of the most popular topic models used for inferring latent topics of document corpora. A limitation of LDA is the inability to model topic correlation even though, for example, a document about genetics is more likely to also be about disease than X-ray astronomy. This limitation stems from the use of the Dirichlet distribution to model the variability among the topic proportions. D. Blei and J. Lafferty [1] developed the correlated topic model (CTM), where the topic proportions exhibit correlation via the logistic normal distribution. Topic correlations are captured through the KxK covariance matrix of the normal distribution, where K is the number of topics. The limitation of CTM is the high inference complexity due to the frequent matrix inverse operations (O(K^3)). Here I propose a new method to model topic correlations by embedding latent topics in a low-dimensional vector space. The intuition is: the closer two topics are in the vector space, the more correlated they are to each other. Through the low-dimensional embedding, the inference complexity can be reduced significantly, boosting industrial-scale applications of correlated topic modeling. I already has a preliminary model. Your job is to further improve the model, and empirically verify the modeling choices on various tasks.

Contact Person: Zhiting Hu (zhitingh AT cs DOT cmu DOT edu)

References:
[1] D. Blei, and J. Lafferty. ``Correlated topic models.'', NIPS 2006.

2) Asynchronous distributed Dirichlet process

Bayesian nonparametric (BNP) methods provide a powerful framework for learning the internal structure of data. For example, Dirichlet processes can be used to cluster with an unbounded number of centers; its derivatives are used in applications such as storyline tracking, taxonomy induction, and image segmentation, to name a few. However, practical applications of BNP models have unfortunately been limited due to their high computational complexity and poor scaling on large data. Though there has been effort on distributed inference algorithms for BNPs, they suffer from expensive synchronization, unbalanced workload, and inaccurate approximation (e.g., [1]), etc. The recent work [2] greatly scales up a tree-structured BNP model using data- and model-parallel variational inference, but its scalability is still limited due to the need of global barriers when switching between different parallelism schemes. In this project, your job is to develop a global-barrier-free distributed inference method of Dirichelt process with minimum parallel error, to truly improve the scalability and boost their utility on industrial-scale problems. The most recent work [3] can be a useful reference.

Contact Person: Zhiting Hu (zhitingh AT cs DOT cmu DOT edu)

References:
[1] P. Smyth et al., ``Asynchronous distributed learning of topic models'', NIPS 2009
[2] Z. Hu et al., ``Large-scale Distributed Dependent Nonparametric Trees'', ICML 2015
[3] T. Campbell et al., ``Streaming, Distributed Variational Inference for Bayesian Nonparametrics'', NIPS 2015

3) Distributed Gaussian process on GPUs

Gaussian processes (GPs) are rich distributions over functions, which provide a Bayesian nonparametric approach to smoothing and interpolation. GP has been widely used in regressian and classification, Bayesian optimization, reinforcement learning, etc. However, GP is typically unable to scale to large modern datasets due to the high training complexity cubic in the number of data points. One research line is to scale up GP with sophisticated parallelization and GPU acceleration (e.g., [1]). On the other hand, a recent work proprosed KISS-GP [2] which reduces complexity from cubic to near-linear order. In this project, your job is to combine these two lines, by parallelizing KISS-GP on distributed computer clusters, and adapt the computation to use GPUs efficiently.

Contact Person: Zhiting Hu (zhitingh AT cs DOT cmu DOT edu)

References:
[1] Z. Dai et al., ``Gaussian Process Models with Parallelization and GPU acceleration'', NIPS workshop 2014
[2] A. Wilson et al., ``Kernel interpolation for scalable structured Gaussian processes'', ICML 2015

4) Dynamic topic model for topic split and merge

Uncovering topic evolution in text stream is helpful for people to keep abreast of hot, new, and intertwining events/topics. Existing dynamic nonparametric topic models (e.g., [1]) assume a topic in a timestamp is either a new topic or derived from one topic in the last timestamp. However, in real life, topics interact with each other and split/merge frequently. A recent work [2] designed a great visualization of topic split/merge, which, however, is created by post-processing based on the output of [1] and can lead to suboptimal results. Can we have a unified model that directly captures the spliting/merging topic flow? In this project, your job is to develop such a powerful and useful dynamic nonparametric topic model.

Contact Person: Zhiting Hu (zhitingh AT cs DOT cmu DOT edu)

References:
[1] J. Zhang et al., ``Evolutionary Hierarchical Dirichlet Processes for Multiple Correlated Time-varying Corpora''. KDD 2010
[2] W. Cui et al., ``TextFlow: Towards Better Understanding of Evolving Topics in Text''

5) Deep learning model on graph-structured data

Deep learning methods, e.g., deep neural networks (DNNs), have achieved hugh success in various tasks involving image and text data. One of the most notable advantages of DL is it enables end-to-end learning without need of feature engineering. However, very few DL methods have been developed to handle graph-structured data. E.g., a recent work [1] proposed a convolutional network which takes in molecular graphs. Compared to images and sentences which usually have fixed size and regular shape, graph structures are highly flexible. Directly operating on graph-stuctured data is thus hard. Considering the ubiquity of network data, developing such an end-to-end neural network can siginificantly extend the application scope of DL, and achieve record-breaking results in various graph-related tasks, as has been done in image classification, text translation, etc. Your job is to develop such a deep learning model. One possible solution is to integrate probabilistic graphical models with deep neural networks.

Contact Person: Zhiting Hu (zhitingh AT cs DOT cmu DOT edu)

References:
[1] D. Duvenaud et al., ``Convolutional Networks on Graphs for Learning Molecular Fingerprints'', NIPS 2015
[2] M. Henaff et al., ``Deep Convolutional Networks on Graph-Structured Data'', Arxiv 2015.

6) Bayesian learning of convolutional neural networks

Plain feedforward neural networks are prone to overfitting. When applied to supervised or reinforcement learning problems these networks are also often incapable of correctly assessing the uncertainty in the training data and so make overly confident decisions about the correct class, prediction or action. We can address these issues through Bayesian learning to introduce uncertainty (expressed and measured by probabilities) in the weights of the networks [1,2,3]. Previous work usually assume an independent Gaussian prior over each of the weights. In this project, your job is to improve the Bayesian learning method. E.g., a simple way is, for convolutional neural networks (Convnet), to introduce structures in the Gaussian prior consistent with the Convnet structure. Backpropagation with varitional inference and MCMC can be used for training.

Contact Person: Zhiting Hu (zhitingh AT cs DOT cmu DOT edu)

References:
[1] C. Blundell et al., ``Weight Uncertainty in Neural Networks'', ICML 2015
[2] J. Hern'andez-Lobato and R. Adams., ``Probabilistic Backpropagation for Scalable Learning of Bayesian Neural Networks'', ICML 2015
[3] R. Neal., ``Bayesian Learning for Neural Networks''. Springer 1996

7) Gaussian graphical models with confounders

The popular graphical lasso method [1] proposes to learn the structure of Gaussian graphical models by optimizing an L1-penalized log-likelihood. Recently, two variants of this approach have been proposed to learn Gaussian graphical models in the presence of confounding variables. The first method [2,3] assumes that these confounding variables are observed and then conditions on them, leading to the conditional Gaussian graphical model (CGGM). The other approach [4] is to assume that these confounding variables are few in number and then marginalize over them, leading to the latent variable Gaussian graphical model (LV-GGM). What happens if we unify these two approaches to account for both observed and unobserved confounding variables? Your job is to either study the statistical properties of such a model, or to propose a fast and scalable algorithm to solve the optimization problem.

Contact Person: Calvin McCarter (calvinm AT cmu DOT edu)

References:
[1] J. Friedman et al., ``Sparse inverse covariance estimation with the graphical lasso.'' Biostatistics 2008
[2] K.A. Sohn and S. Kim. ``Joint estimation of structured sparsity and output structure in multiple-output regression via inverse-covariance regularization." AISTATS 2012
[3] M. Wytock and Z. Kolter. ``Sparse Gaussian conditional random fields: Algorithms, theory, and application to energy forecasting.'' ICML 2013
[4] V. Chandrasekaran et al. ``Latent variable graphical model selection via convex optimization.'' The Annals of Statistics 2012

8) Structure learning for SNP-gene networks with dropout regularization

Dropout training [1] is a popular method for regularizing deep neural networks. Recent work [2] indicates that dropout is adaptive, and that it avoids over-penalizing features that are useful but rare. In this project, we want to find out if dropout is helpful for graphical model structure learning. One potential application is for learning conditional Gaussian graphical models [3] which have been used for learning SNP-gene networks. One challenge with learning SNP-gene networks is that many mutations are rare but highly influential, so dropout may be particularly useful in this area. Another challenge when working with SNP data is that nearby SNPs tend to be highly correlated, so as part of your project, you can try to determine whether dropout improves feature selection when features are correlated.

Contact Person: Calvin McCarter (calvinm AT cmu DOT edu)

References:
[1] N. Srivastava et. al. ``Dropout: a simple way to prevent neural networks from overfitting.'' JMLR 2014
[2] S. Wager et. al. ``Dropout training as adaptive regularization.'' NIPS 2013
[3] K.A. Sohn and S. Kim. ``Joint estimation of structured sparsity and output structure in multiple-output regression via inverse-covariance regularization.'' AISTATS 2012

9) Semi-supervised Dirichlet Process

Semi-supervised clustering is the task of clustering data points into clusters where only a fraction of the points are labelled. The true number of clusters in the data is often unknown. Dirichlet process (DP) mixture models, though previously applied in conventional (i.e., unsupervised) clustering setting, are appealing as they can infer the number of clusters from the data. In this project, your job is to adapt the DP to the semi-supervised setting. An interesting real application is zero-shot learning. Here is a few relevant references.

Contact Person: Zhiting Hu (zhitingh AT cs DOT cmu DOT edu)

References:
[1] F. Akova et al., ``Self-adjusting Models for Semi-supervised Learning in Partially-observed Settings'', ICDM 2012.
[2] Hal Daume III and Daniel Marcu. ``A Bayesian Model for Supervised Clustering with the Dirichlet Process Prior'', JMLR 2005
[3] Amar Shah and Zoubin Ghahramani. ``Determinantal Clustering Process - A Nonparametric Bayesian Approach to Kernel Based Semi-Supervised Clustering'', UAI 2013

10) Video concept embedding

There has been a growing interest in distributed representation that learns compact vectors (a.k.a embedding) for linguistic items such as words [1], phrases, and entities [2], etc. The induced vectors are expected to capture semantic relatedness of the linguistic items. For example, Mikolov et al., [1] shows the word vectors induced by the skip-gram model exhibit the language regularity that v(``king'')-v(``man'')+v(``woman'') = v(``queen''). Here v(``x'') is the vector representation of the word x. Can we extend this to the visual domain to induce the embeddings and capture the regularity underlying the concepts within videos? A promising method is to first extract concepts from videos using existing concept detection tools (e.g., [3]), and then adapt the skip-gram model to learn the concept embeddings. A few other recent works [4] might also be inspiring.

Contact Person: Zhiting Hu (zhitingh AT cs DOT cmu DOT edu)

References:
[1] T. Mikolov et al., ``Distributed Representations of Words and Phrases and their Compositionality'', NIPS 2013.
[2] Z. Hu et al., ``Entity Hierarchy Embedding'', ACL 2015
[3] S Assari et al., ``Video Classiï¬cation using Semantic Concept Co-occurrences'', CVPR 2014
[4] S. Kottur et al., ``Visual Word2Vec(vis-w2v): Learning Visually Grounded Word Embeddings Using Abstract Scenes'', Arxiv (http://arxiv.org/abs/1511.07067)

11) Optical character recognition using recurrent neural networks

Optical character recognition (OCR) is to convert images of typed, handwritten or printed text into machine-encoded text. As a high-impact real-world application, OCR has attracted plenty of research attention. A few recent works (e.g, [1,2]) use deep learning techniques to tackle this problem. Can we further improve over these methods? Inspired by the recently successful speech recoginition works [3], a promising method is to adapt the speech recurrent neural network to the OCR setting. In this project, your job is to develop such a model and validate on real datasets.

Contact Person: Zhiting Hu (zhitingh AT cs DOT cmu DOT edu); Yuntian Deng (yuntiand AT cs DOT cmu DOT edu)

References:
[1] Hui Li and Chunhua Shen. ``Reading Car License Plates Using Deep Convolutional Neural Networks and LSTMs'', Arxiv (http://arxiv.org/abs/1601.05610)
[2] P. He et al., ``Reading Scene Text in Deep Convolutional Sequences''. Arxiv (http://arxiv.org/abs/1506.04395)
[3] D. Amodei et al., ``Deep Speech 2: End-to-End Speech Recognition in English and Mandarin''. Arxiv (http://arxiv.org/abs/1512.02595)

12) Optical graph recognition

Optical character recognition (OCR) is to convert images of typed, handwritten or printed text into machine-encoded text. OCR has been limited to handle text (e.g., english letters, mathematical symbols, etc). Can we extend this task to recognize graphs (e.g., circles and lines in a mathematical graph), i.e., optical graph recognition (OGR)? A potential method to this end is to first define a set of items that you aim to recognize, a representation of each of these items, and a typesetting model on how this latent items generate the observations (i.e., a generative model). The following reference [1,2] on conventional OCR might be inspiring.

Contact Person: Zhiting Hu (zhitingh AT cs DOT cmu DOT edu); Yuntian Deng (yuntiand AT cs DOT cmu DOT edu)

References:
[1] Taylor Berg-Kirkpatrick and Dan Klein., ``Improved Typesetting Models for Historical OCR''. ACL 2014
[2] T. Berg-Kirkpatrick et al., ``Unsupervised Transcription of Historical Documents''. ACL 2013