Grading

Advanced Algorithms and Models for Computational Biology

10-810, Spring 2007

School of Computer Science, Carnegie-Mellon University

Course Project

Your class project is an opportunity for you to explore an interesting multivariate analysis problem of your choice in the context of a real-world data set. Projects can be done by you as an individual, or in teams of two to three students. Each project will also be assigned a 708 instructor as a project consultant/mentor. They will consult with you on your ideas, but the final responsibility to define and execute an interesting piece of work is yours. Your project will be worth 30% of your final class grade, and will have two final deliverables:

1. a writeup in the form of a IEEE paper (6-8 pages maximum in IEEE format, including references), due May 10, worth 60% of the project grade, and

2. a poster and an oral presenting your work for a special class session at the end of the semester, on May 10, worth 20% of the project grade.

In addition, you must turn in a project proposal (2 pages maximum in IEEE format , including references) stating the members, the problem they will work on, related literature, proposed plan and expected outcome by Mar 22, worth 20% of the project grade. Note that, as with any conference, the page limits are strict! Papers over the limit will not be considered.

Project Proposal:

You must turn in a brief project proposal (1-2 pages) by Mar 22 th.

You are encouraged to come up a topic directly related to your own current research project or research topics related to graphical models of your own interest that bears a non-trivial technical component (either theoretical or application-oriented), but the proposed work must be new and should not be copied from your previous published or unpublished work. For example, research on graphical models that you did this summer does not count as a class project.

You may use the list of available dataset provided bellow and pick a “less adventurous” project from the following list of potential project ideas. These data sets have been successfully used for machine learning in the past, and you can compare your results with those reported in the literature. Of course you can also choose to work on a new problem beyond our list used the provided dataset.

Project proposal format: Proposals should be one page maximum. Include the following information:

· Project title

· Project idea. This should be approximately two paragraphs.

· Software you will need to write.

· Papers to read. Include 1-3 relevant papers. You will probably want to read at least one of them before submitting your proposal

· Teammate(s): will you have teammate(s)? If so, whom? Maximum team size is three students.

· Mar 22 milestone: What will you complete by Mar 22? Experimental results of some kind are expected here.

Project suggestions:

· Ideally, you will want to pick a problem in a domain of your interest, e.g., DNA sequence analysis, genetics polymorphisms, regulatory networks, etc., and formulate your problem using a statistical machine learning formalism. You can then, for example, adapt and tailor standard inference/learning algorithms to your problem, and do a thorough performance analysis.

You can also find some project ideas below.

Project A: Haplotyping blocking and genetic demorgraphical inference (see Eric for more details)

Genetic polymorphisms such as SNPs and Microsatellite carry important information of human evolution and disease propensity. One of the interesting problems in this area is to infer the haplotype of long sequence of ambiguous genotypes based on haplotypes of small overlapping regions. In this project we want to build a haplotype assembler using a partition-ligation scheme and/or a tiling scheme to stitch together short haplotypes inferred by off-the-shelf haplotype inference algorithm; and then, after determining long haplotypes of a long stretch of markers, find the best block structure using dynamic programming and information theoretic scoring. The resulting blocks will provide essential markers for mapping disease genes and for inferring the evolutionary history of given populations.

Reference:

Niu et al. Bayesian Haplotype Inference for Multiple Linked Single-Nucleotide Polymorphisms, Am J Hum Genet. 2006 Jan;78(1):174

Anderson EC, Novembre J: Finding Haplotype Block Boundaries by Using the Minimum-Description-Length Principle. American Journal of Human Genetics 2003, 73:336-354.

Project B: Discovering network motifs and recurring subgraphs from sequences of biological networks (see Eric for more details)

Network motifs refer to recurring subgraphs and connectivity patterns in a single or multiple networks. They usually represent certain pathway components and bio-regulatory mechanisms, and their occurrence profiles are often unique to different networks and imply intrinsic functionalities of the biological networks. Early research in this area focuses on searching for small motif in a single network. In this project we want to develop algorithms for searching large and possibly overlapping subgraphs recurring over multiple graphs. We will explore algorithms for constructing multiple networks, and graph theoretical approaches to mine such networks for motifs.

Reference:

Hu H, Yan X, Huang Y, Han J, Zhou XJ (2005) Mining coherent dense subgraphs across massive biological networks for functional discovery. Bioinformatics (ISMB 2005), Vol. 21 Suppl. 1 2005, pages 213-221. Supplementary Material/Software

Zhou XJ, Kao MJ, Huang H, Wong A, Nunez-Iglesias J, Primig M, Aparicio OM, Finch CE, Morgan TE, Wong WH (2005) Functional annotation and network reconstruction through cross-platform integration of microarray data. Nature Biotechnology 2005 Feb;23(2):238-43.

Project C: Protein function prediction from interaction network using graph theoretic and statistical latent-space modeling approaches (see Eric for more details)

Local and global connectivities of an element in a network are often indicative of its functions; and such predictions often going beyond the traditional approaches that are based on physical and sequence properties biological element, but seeks a combination of such properties with its interaction contexts in biological processes, as reflected in the network, and such predictions can often be context-specific. In this project explore algorithms to infer biological functions of proteins from protein-protein interaction networks and other protein attributes.

E. Airoldi, D. Blei, E.P. Xing and S. Fienberg, A Latent Mixed Membership Model for Relational Data. Workshop on Link Discovery: Issues, Approaches and Applications (LinkKDD-2005).

Project D: Genetic instability (this is an open research project, if you are interested, come to Eric Xing to discuss details):

Array CGH data are sequences of fluorescence measurements reflecting the DNA copy numbers along the chromosome. The measurements are continuous and can be highly distorted by noises in a complex, non-uniform fashion. Jane Fridlyand proposed a Hidden Markov Models Approach to the Analysis of Array CGH Data, where she implement an HMM model for estimating the CGH copy number. But this model is very restricted.

A switching Hidden Process Model assumes that the hybridization process on each chromosomal region with uniform copy number would ideally follow a standard copy-number-specific linear dynamic model (LDM) [West and Harrison, 1999]. To accommodate outliers and alternative hybridization and signaling dynamics, a mixture of LDMs can be used to model a hidden process that generates fluorescence signals from a chromosomal region with a specific copy number. For a chromosome with stochastic regional amplifications and deletions, a switching HPM assumes that another discrete hidden process is responsible to selecting the corresponding copy-number-specific HPM at each region to generate the signals. The switching HPM model is essentially a special dynamic Bayesian network that allows one to infer the temporalspatially-specific hidden dynamics underlying an observation stream and the ensuing segmentation of the stream. It is a generalization to Ghahramani's SSSM which can be understood as modeling each hidden process using a plain KF. In this project you are asked to formulate this model and implement a variational algorithm for inference with such model.

In the dataset (log2.ratio.ex), there are two columns of numbers, corresponding to two sample sources. Please read the original paper to get a more detailed understanding of the data. You can choose the appropriate number of state you feel necessary after inspecting the plots of the points.

Project E: (please contact Ziv for more details): Dynamic Bayesian networks from time series datasets.

Time series Expression data measures the levels of genes following specific treatment. For example, following pathogen infection such data can provide insight to the set of genes that are responding to the infection and to the immune response system. Using time series data we would like to learn a graphical model that represent the set of interactions that are employed as part of the response. In this project you will explore ways to use time series datasets for determining the structure and parameters of the regulatory network underlying the observed responses.

While many methods were suggested for classifying static expression data, I am only familiar with two methods that attempted to classify time series expression data. The challenge here is to take advantage of the temporal information while predicting outcome of patient response. There are many possible applications ranging from follow-up analysis, early diagnosis and monitoring of transplant patients.

Identifying orthologus genes is a challenging problem. There are a number of different sequence based approaches proposed for this task, but so far none of these approaches achieves perfect or near perfect results. Here we will study whether expression data can help resolve disagreements between the approaches leading to a more accurate set of orthologus genes.

Using recent detailed maps regarding the binding sites of transcription factors researchers are studying global regulatory networks. However, most of the links in these networks are utilized in different time points or under different conditions. In this project we will try to determine whether the structure of specific response networks is similar or different from the global structure and what can we learn about the way regulatory networks operate from these structures.Identifying orthologus genes is a challenging problem.

Recent experiments have identified many new protein-protein interactions. While the quality of this data is not great, it does serve as a useful source for integration with other available datasets. In this project you will explore the relationship between the interacting proteins and other types of high throughput data (such as expression or binding). Specifically, it is interesting to see of aspects that cannot be inferred from the current interaction data (such as pathways) can be determined by using these complementary data sources.