Topics in Bioinformatics

Back to "Biological Language Modeling Seminar Topics"

Reference: Prediction of protein-protein interactions from sequence

A. Sequence based methods

1. presence of absence of genes in related species

- based on functional need for two proteins to be present at the same time

2. Conservation of gene neighborhood

- operons Reference: Overbeek et al. (1999) The use of gene clusters to infer functional coupling. PNAS 96, 2896-2901.

- if conserved across species, strong indicator

3. Conservation of Gene Order

Reference: Dandekar et al (1998) Conservation of gene order: a fingerprint of proteins that physically interact. Trends Biochem. Sci. 23, 324-328.

4. Gene fusion events

- protein domains as part of a single chain or as independent proteins

References:

Marcotte et al (1999) Detecting protein function and protein-protein interactions from genome sequences. Science

Enright et al. (1999) Protein interaction maps for ocmplete genomes based on gene fusion events. Nature

5. Mirrortrees and Phylogenetic Profiles

- interacting protein pairs co-evolve, e.g. insulin and its receptors

- the corresponding phylogenetic trees of the interacting proteins show a greater degree of similarity (Symmetry) than non-interacting proteins

Reference: Pellegrini et al. (1999) Assigning protein functions by comparative gene analysis: protein phylogenetic profiles. PNAS 96, 4285-4288.

6. tree determinants

family patterns of conservation in multiple sequence alignment

correlated mutations highlight the interaction sites in binding protein

B. Learning systems

1. Correlated mutations

correspond to compensatory mutations that stabilize the mutation in one protein with changes in the other

predict proximal pairs of residues

to discriminate structure models derived by threading

ab initio folding simulations

interprotein correlated mutations versus intraprotein correlated mutations

test set: two-domain proteins

Reference: Pazos et al (1997) JMB

2. statistical composition of interacting surfaces in terms of residue types (polarity, charge etc.)

Reviewed in: Jones and Thornton 1996 PNAS

Example: Patch analysis (Jones and Thornton (1997) JMB 272, 133-143)

each residue patch (~20-60 residues) is analyzed for six parameters:

1. solvation potential

2. residue interface propensity

3. hydrophobicity

4. planarity

5. protrusion

6. accessible surface area

Comparison of inter- and intra-protein interfaces:

Jones et al. 2000

very similar

3. structure of the surface

4. use 6+7 for Neural Networks, good for 70% accuracy

5. MULTIPROSPECTOR - multimeric threading

- like traditional threading

- establish database of proteins that interact - thread alone and in complex, from this establish empirical indicators based on the threading Z-score and the magnitude of the interfacial energy

6. Support vector machine classification

use DIP interaction pairs as learning dataset

Feature representation: amino acid sequence use charge, hydrophobicity and surface tension, based on observation that sequential hydrophilicity profiles are sensitive descriptors of local interaction sites.

Reference: Bock and Gough (2001) Bioinformatics

C. Homology-based prediction

use the experimentally determined protein-protein interactions from one organism to predict the protein protein interaction map in another organisms based on domain homologies

IDPP = interacting domain profile pair

References:

Wojcik et al. (2002) JMB

Wojcik and Schachter (2001) Bioinformatics

Comparison to experimental data:

database of interacting proteins

1138 of 2865 pairs have "real" counterparts

PSI-BLAST: 1215 of 1781 pairs have "real" counterparts