|
Overview of the Biological Language Modeling Project
The Biological Language Modeling project is based on the assumption that protein sequences from
different organisms may be viewed as texts written in different languages.
The mapping of protein sequence to their structure, dynamics and function
then becomes analogous to the mapping of words to meaning in natural
languages. This analogy can be exploited by application of statistical
language modeling and text classification techniques to biological
sequences, thereby generating testable hypotheses regarding the fundamental
building blocks of "protein sequence language". The biology-language
analogy enables novel applications of language technologies to the biology
domain, but is to a great extent overlapping with existing other
computational biology/bioinformatics applications.
|
|
Goal
1. Integration of linguistic analyses results and those from
structural dynamics for characterizing key residues that control
enzymatic functions |
Principle
Investigators |
Ivet Bahar,
Michele
Loewen, Cathy
Costello, Jaime
Carbonell,
Roni
Rosenfeld |
Students |
Alpay Temiz,
Lee-Wei Yang |
Postdocs |
Dror
Tobi, Shyamasri Biswas |
i.
Computational assessment of the hinge-bending and other key mechanical
sites that coordinate functional dynamics of representative sets of
enzymes, using the recently compiled (Thornton and coworkers) enzyme
classifications and network models of protein structure and
dynamics..
ii.
Systematic analysis of the structural regions/motifs that are implicated
in key mechanical roles, using graph theoretical, computer vision and
linguistic tools
iii.
Experimental characterization of the mechanism of action of particular
classes of enzymes as case studies (e.g. dioxygenase enzymes in
collaboration with NRC)
iv. IMPACT:
Discover new potential ligand-binding sites
v.
EVALUATION: Test key mechanical sites predictions against known inhibitor
binding sites towards elucidating the possible coupling between mechanical
and chemical activities, and hypothesize potential new binding sites to be
tested experimentally
|
|
|
Goal
2. Deep understanding of transmembrane protein folding and
function
|
Principle
Investigators |
Judith
Klein-Seetharaman, Gobind
Khorana, Ivet Bahar,
Raj Reddy, Hagai
Meirovitch |
Students |
Basak
Isin, Madhavi
Ganapathiraju, Jiangbo Miao,
Naveena
Yanamala |
Postdocs |
AJ Rader, Harpreet
Kaur Dhiman, David
Man |
i. Discern
property conservation of amino acids (e.g. discriminating hydrophobic
amino acids facing outside vs. inside of transmembrane helices)
ii.
Identify the most salient motifs for structural stability and for
governing conformational changes, building on our existing results for
rhodopsin and leveraging context-dependent statistical grammars or similar
approaches.
iii.
Discriminate between functional-relevant and structurally-relevant motifs.
iv.
IMPACT: New strategies to help diagnose and eventually treat
conformational diseases associated with transmembrane proteins.
v. EVALUATION:
site-directed mutagenesis experiments.
|
|
Goal
3. Deep understanding of relation between b-sheet formation and
underlying primary sequences
|
Principle
Investigators |
Jaime
Carbonell,
Jonathan
King, Vanathi
Gopalakrishnan,Judith
Klein-Seetharaman |
Students |
Yan
Liu, Welkin
Pope, Ryan Simkovsky |
Postdocs |
Peter
Weigele |
i. Extension to supersecondary structures via long-range probabilistic linguistic models
ii. Explore b-sheet transmembrane proteins (incl b-barrels)
iii. BIOLOGICAL IMPACT: Predictive models for b-sheets and selected b-sheet based supersecondary structures
iv. BIOLOGICAL EVALUATION: Reduce prediction errors by up to 50% for existing prediction tasks, and predict structures for which there are no present prediction results.
v. COMPUTATIONAL IMPACT: New predictive algorithms and techniques such as multi-layer conditional random fields
(CRFs)
vi. COMPUTATIONAL EVALUATION: Performance measured on new problems in relations text extraction and understanding
|
|
Goal
4. Discovery of vocabulary for conservation pressure in protein
evolution
|
Principle
Investigators |
Roni
Rosenfeld, Judith
Klein-Seetharmanan, Vanathi
Gopalakrishnan,
Michele
Loewen,
Jonathan
King |
Students |
Jerry
Zhu, Yong
Lu, Oznur
Tastan |
Postdocs |
|
i. Based on multi-dimensional position-conditional properties
ii. Expansion from HIV and GPCR families to other families including kinases and nuclear receptors
iii. Further expansion to broader datasets including all possible protein families, and structurally homologous protein families
iv. IMPACT: Detect very remote
homologs.
v. EVALUATION: Identification of heretofore unknown homologs, to be validated via biological experimentation.
|
|
Goal 5. Algorithmic solutions for the discovery of large-scale gene
regulatory networks and for function prediction
|
Principle
Investigators |
Yiming Yang, Eric
Xing |
Students |
Fan
Li |
Postdocs |
|
i.Developing new algorithms to extract multi-type expressions of genes
from micro-array data, DNA sequences, protein-protein interaction databases
and gene ontology, and to induce regulatory networks based on the multi-types of evidence
ii. Adapting hierarchical classification techniques to the motif/gene identification and function analysis based on multi-abstraction-level
representations of protein sequences
iii. Improving the efficiency of machine learning algorithms for
automated induction of very large regulatory networks, e.g., with thousands
of genes
iv. IMPACT: The resulting regulatory networks and gene/motif classes
would help biologists to discover new and interesting patterns, e.g. leading
to deeper understanding of the mechanisms regulating oncogenes and their potential disruption.
v. EVALUATION: Identification of regulatory networks and gene classes
(functions), to be validated first via biology databases (TRANSFAC, GO,
SCPD...), and later possibly via collaboration with personnel in the U Pitt
Cancer Center.
|
|
Goal
6. Protein protein interaction prediction
|
Principle
Investigators |
Ziv
Bar-Joseph, Judith
Klein-Seetharaman,
Michele Loewen |
Students |
Yanjun Qi |
Postdocs |
|
i. Develop new approaches and explore all available features for protein
protein interaction prediction
ii. IMPACT: comprehensive prediction of protein pairs in entire organisms,
starting with yeast, but eventually for human
iii. EVALUATION: precision and recall optimization of existing datasets, small-scale validation with specific pathways and wet-lab experiments
|
|
Goal
7. Development of
user-friendly BLM service hub with website interface
|
Principle
Investigators |
Alain
Rappaport, Judith
Klein-Seetharaman, Hassan Karimi |
System Administrator |
Mark Holliman |
Students |
Madhavi
Ganapathiraju ,Yanjun Qi,Mitch Saltykov |
i. Develop
a consistent interface for publishing and invoking services
ii.
Develop tutorials for adding, publishing and invoking services in specific
computational biology contexts
iii. Organize
workshops to train users internally and externally
iv.
IMPACT: effective dissemination and integration into computational biology
community
v.
EVALUATION: Integration of BLM tools in external projects
|
|
|
|