Overview of the Biological Language Modeling Project

The Biological Language Modeling project is based on the assumption that protein sequences from different organisms may be viewed as texts written in different languages. The mapping of protein sequence to their structure, dynamics and function then becomes analogous to the mapping of words to meaning in natural languages. This analogy can be exploited by application of statistical language modeling and text classification techniques to biological sequences, thereby generating testable hypotheses regarding the fundamental building blocks of "protein sequence language". The biology-language analogy enables novel applications of language technologies to the biology domain, but is to a great extent overlapping with existing other computational biology/bioinformatics applications.

Goal 1. Integration of linguistic analyses results and those from structural dynamics for characterizing key residues that control enzymatic functions
Principle Investigators	Ivet Bahar, Michele Loewen, Cathy Costello, Jaime Carbonell, Roni Rosenfeld
Students	Alpay Temiz, Lee-Wei Yang
Postdocs	Dror Tobi, Shyamasri Biswas
i. Computational assessment of the hinge-bending and other key mechanical sites that coordinate functional dynamics of representative sets of enzymes, using the recently compiled (Thornton and coworkers) enzyme classifications and network models of protein structure and dynamics.. ii. Systematic analysis of the structural regions/motifs that are implicated in key mechanical roles, using graph theoretical, computer vision and linguistic tools iii. Experimental characterization of the mechanism of action of particular classes of enzymes as case studies (e.g. dioxygenase enzymes in collaboration with NRC) iv. IMPACT: Discover new potential ligand-binding sites v. EVALUATION: Test key mechanical sites predictions against known inhibitor binding sites towards elucidating the possible coupling between mechanical and chemical activities, and hypothesize potential new binding sites to be tested experimentally

Goal 2. Deep understanding of transmembrane protein folding and function
Principle Investigators	Judith Klein-Seetharaman, Gobind Khorana, Ivet Bahar, Raj Reddy, Hagai Meirovitch
Students	Basak Isin, Madhavi Ganapathiraju, Jiangbo Miao, Naveena Yanamala
Postdocs	AJ Rader, Harpreet Kaur Dhiman, David Man
i. Discern property conservation of amino acids (e.g. discriminating hydrophobic amino acids facing outside vs. inside of transmembrane helices) ii. Identify the most salient motifs for structural stability and for governing conformational changes, building on our existing results for rhodopsin and leveraging context-dependent statistical grammars or similar approaches. iii. Discriminate between functional-relevant and structurally-relevant motifs. iv. IMPACT: New strategies to help diagnose and eventually treat conformational diseases associated with transmembrane proteins. v. EVALUATION: site-directed mutagenesis experiments.

Goal 3. Deep understanding of relation between b-sheet formation and underlying primary sequences
Principle Investigators	Jaime Carbonell, Jonathan King, Vanathi Gopalakrishnan,Judith Klein-Seetharaman
Students	Yan Liu, Welkin Pope, Ryan Simkovsky
Postdocs	Peter Weigele
i. Extension to supersecondary structures via long-range probabilistic linguistic models ii. Explore b-sheet transmembrane proteins (incl b-barrels) iii. BIOLOGICAL IMPACT: Predictive models for b-sheets and selected b-sheet based supersecondary structures iv. BIOLOGICAL EVALUATION: Reduce prediction errors by up to 50% for existing prediction tasks, and predict structures for which there are no present prediction results. v. COMPUTATIONAL IMPACT: New predictive algorithms and techniques such as multi-layer conditional random fields (CRFs) vi. COMPUTATIONAL EVALUATION: Performance measured on new problems in relations text extraction and understanding

Goal 4. Discovery of vocabulary for conservation pressure in protein evolution
Principle Investigators	Roni Rosenfeld, Judith Klein-Seetharmanan, Vanathi Gopalakrishnan, Michele Loewen, Jonathan King
Students	Jerry Zhu, Yong Lu, Oznur Tastan
Postdocs
i. Based on multi-dimensional position-conditional properties ii. Expansion from HIV and GPCR families to other families including kinases and nuclear receptors iii. Further expansion to broader datasets including all possible protein families, and structurally homologous protein families iv. IMPACT: Detect very remote homologs. v. EVALUATION: Identification of heretofore unknown homologs, to be validated via biological experimentation.

Goal 5. Algorithmic solutions for the discovery of large-scale gene regulatory networks and for function prediction
Principle Investigators	Yiming Yang, Eric Xing
Students	Fan Li
Postdocs
i.Developing new algorithms to extract multi-type expressions of genes from micro-array data, DNA sequences, protein-protein interaction databases and gene ontology, and to induce regulatory networks based on the multi-types of evidence ii. Adapting hierarchical classification techniques to the motif/gene identification and function analysis based on multi-abstraction-level representations of protein sequences iii. Improving the efficiency of machine learning algorithms for automated induction of very large regulatory networks, e.g., with thousands of genes iv. IMPACT: The resulting regulatory networks and gene/motif classes would help biologists to discover new and interesting patterns, e.g. leading to deeper understanding of the mechanisms regulating oncogenes and their potential disruption. v. EVALUATION: Identification of regulatory networks and gene classes (functions), to be validated first via biology databases (TRANSFAC, GO, SCPD...), and later possibly via collaboration with personnel in the U Pitt Cancer Center.

Goal 6. Protein protein interaction prediction

Principle Investigators Ziv Bar-Joseph, Judith Klein-Seetharaman, Michele Loewen

Students Yanjun Qi

Postdocs

i. Develop new approaches and explore all available features for protein protein interaction prediction

ii. IMPACT: comprehensive prediction of protein pairs in entire organisms, starting with yeast, but eventually for human

iii. EVALUATION: precision and recall optimization of existing datasets, small-scale validation with specific pathways and wet-lab experiments

Goal 7. Development of user-friendly BLM service hub with website interface

Principle Investigators Alain Rappaport, Judith Klein-Seetharaman, Hassan Karimi

System Administrator Mark Holliman

Students Madhavi Ganapathiraju ,Yanjun Qi,Mitch Saltykov

i. Develop a consistent interface for publishing and invoking services

ii. Develop tutorials for adding, publishing and invoking services in specific computational biology contexts

iii. Organize workshops to train users internally and externally

iv. IMPACT: effective dissemination and integration into computational biology community

v. EVALUATION: Integration of BLM tools in external projects

For questions or comments please contact

Judith Klein-Seetharaman
Assistant Professor
jks33@pitt.edu

Department of Pharmacology
University of Pittsburgh Medical School
Biomedical Science Tower E1355
Pittsburgh, PA 15261
Tel: 412-383-7325
Fax: 412-648-1945

OR

Language Technologies Institute
Carnegie Mellon University
School of Computer Science
Smith Hall 225
Pittsburgh, PA15213
Tel: 412-268-8249

BLM Web Toolkit | BLC Conferences| Center Publications

Judith Klein Seetharaman | Administrative Staff | Old Members

Last updated June 08, 2005

Supported by the National Science Foundation

Disclaimer