Back to "Biological Language Modeling Seminar Topics"
Reference: Prediction of protein-protein interactions from sequence
A. Sequence based methods
1. presence of absence of genes in related species
- based on functional need for two proteins to be present at the same time
2. Conservation of gene neighborhood
- operons Reference: Overbeek et al. (1999) The use of gene clusters to infer functional coupling. PNAS 96, 2896-2901.
- if conserved across species, strong indicator
3. Conservation of Gene Order
Reference: Dandekar et al (1998) Conservation of gene order: a fingerprint of proteins that physically interact. Trends Biochem. Sci. 23, 324-328.
4. Gene fusion events
- protein domains as part of a single chain or as independent proteins
References:
5. Mirrortrees and Phylogenetic Profiles
- interacting protein pairs co-evolve, e.g. insulin and its receptors
- the corresponding phylogenetic trees of the interacting proteins show a greater degree of similarity (Symmetry) than non-interacting proteins
Reference: Pellegrini et al. (1999) Assigning protein functions by comparative gene analysis: protein phylogenetic profiles. PNAS 96, 4285-4288.
6. tree determinants
family patterns of conservation in multiple sequence alignment
correlated mutations highlight the interaction sites in binding protein
B. Learning systems
1. Correlated mutations
correspond to compensatory mutations that stabilize the mutation in one protein with changes in the other
predict proximal pairs of residues
to discriminate structure models derived by threading
ab initio folding simulations
interprotein correlated mutations versus intraprotein correlated mutations
test set: two-domain proteins
Reference: Pazos et al (1997) JMB
2. statistical composition of interacting surfaces in terms of residue types (polarity, charge etc.)
Reviewed in: Jones and Thornton 1996 PNAS
Example: Patch analysis (Jones and Thornton (1997) JMB 272, 133-143)
each residue patch (~20-60 residues) is analyzed for six parameters:
1. solvation potential
2. residue interface propensity
3. hydrophobicity
4. planarity
5. protrusion
6. accessible surface area
Comparison of inter- and intra-protein interfaces:
very similar
3. structure of the surface
4. use 6+7 for Neural Networks, good for 70% accuracy
5. MULTIPROSPECTOR - multimeric threading
- like traditional threading
- establish database of proteins that interact - thread alone and in complex, from this establish empirical indicators based on the threading Z-score and the magnitude of the interfacial energy
6. Support vector machine classification
use DIP interaction pairs as learning dataset
Feature representation: amino acid sequence use charge, hydrophobicity and surface tension, based on observation that sequential hydrophilicity profiles are sensitive descriptors of local interaction sites.
Reference: Bock and Gough (2001) Bioinformatics
C. Homology-based prediction
use the experimentally determined protein-protein interactions from one organism to predict the protein protein interaction map in another organisms based on domain homologies
IDPP = interacting domain profile pair
References:
Wojcik and Schachter (2001) Bioinformatics
Comparison to experimental data:
database of interacting proteins
1138 of 2865 pairs have "real" counterparts
PSI-BLAST: 1215 of 1781 pairs have "real" counterparts