Topics in Bioinformatics

Back to "Homepage of Judith Klein-Seetharaman"

G-protein coupled receptor classification

Karchin, R., K. Karplus, and D. Haussler, Classifying G-protein coupled receptors with support vector machines. Bioinformatics, 2002. 18(1): p. 147-59.

Task:

GPCR

ClassA, Class B, ClassC etc. (considered roughly analogous to SCOP families, Reference for use of SVF for SCOP superfamily classification: Jaakola et al. (2000))

Level 1 subfamily of Class A: amine, peptide, opsin, viral ...

Level 2 subfamily of Amine: muscarinic, serotonin, histamine, dopamine ...

Approach:

compare different classification methods:

1. simple nearest neighbor approach (BLAST) [original BLAST reference is Altschul, S.F., W. Gish, W. Miller, E.W. Myers, and D.J. Lipman, Basic local alignment search tool. J Mol Biol, 1990. 215(3): p. 403-10; WU-BLAST used here, reference is http://blast.wustl.edu

2. multiple sequence alignment (HMM): a sophisticated nearest neighbor approach that bulds a library of statistical odels for protein classes of interest . THe target sequence is scored against all models in the library and classified according to the model that gives the best score to the sequence [SAM-T2K HMM: Hughey, R. and A. Krogh, Hidden Markov models for sequence analysis: extension and analysis of the basic method. Comput Appl Biosci, 1996. 12(2): p. 95-107; Karplus, K., C. Barrett, and R. Hughey, Hidden Markov models for detecting remote protein homologies. Bioinformatics, 1998. 14(10): p. 846-56]

3. methods that transform protein sequences into fixed-length feature vectors (SVM): transform protein sequence into Fisher Score Vector (FSV) space and build library of SVM of each protein class and score each test sequence against the entire library as in the HMM experiment (SVMtree is a more efficient way to do the same)

Results from 2-fold cross-validation experiments, shown as ROC curves (number of false positives versus coverage)

Level 2 subfamily classification:

Errors per sequence at the Minimum Error Point (MEP):

    13.7% for multi-class SVM best method at this level

    17.1% for SVMtree of hierarchical multi-class SVM

    25.5% for BLAST

    30% for profile HMM

    49% nearest neighbor feature vector Kernel Nearest Neighbor (kernNN)

Percentage of true positives before the first false positive (coverage):

    65% both SVMs

    13% BLAST

    5% profile HMM

    4% kernNN

Level 1 subfamily classification:

average error at MEP:

11.6% for SVM best method at this level

16.7% for BLAST

30.1% for SAM-T2K HMM

36.0% for kernNN

Coverage (%)

48% for SVM

29% for BLAST

6% for SAM-T2K HMM

38% for kernNN

GPCR or not (trained on positive examples = Class B-E, negative examples from SCOP and archaeal rhodopsins and G proteins, tested on Class A):

average error at MEP:

0.04 % for SAM-T99 HMM best method at this level

0.22 % for SVM

6.82 % for FPS BLAST

Coverage:

98 % for SAM-T99 HMM

72 % for SVM

3 % for FPS BLAST

Webserver:

subfamily classification based on hierarchical multi-class SVM at:

http://www.soe.ucsc.edu/research/compbio/gpcr-subclass

list of predictions for human GPCR:

http://www.soe.ucsc.edu/research/compbio/gpcr_hg/class_result

Reference for previous classification:

Horn, F., J. Weare, M.W. Beukers, S. Horsch, A. Bairoch, W. Chen, O. Edvardsen, F. Campagne, and G. Vriend, GPCRDB: an information system for G protein-coupled receptors. Nucleic Acids Res, 1998. 26(1): p. 275-9

the organization into classes is originally based on the pharmacological classification of receptors

classes share > 20% sequence identity over predicted transmembrane helices

Multiple sequence alignment designed specifically for GPCR based on WHAT IF (using neighbor-joining algorithm): ref. Oliveira, L., A.C. Paiva, and G. Vriend, J Comp.-Aid. Mol. Des., 1993. 7(649-648)

In this method, the sequences are numbered such that 100s digits indicate the helix number and the most conserved residue in every helix has a round number. However, we found that this is sometimes wrong and parts of the sequence got 100s numbers that are not TM helices.

Horn, F., G. Vriend, and F.E. Cohen, Collecting and harvesting biological data: the GPCRDB and NucleaRDB information systems. Nucleic Acids Res, 2001. 29(1): p. 346-9

same paper as Horn et al. (1998) but updated

July 2000:

1796 sequences

1290 structures

12000 ligand binding data (dissociation constants), manually extracted from the literature by P. Seeman (see ref. in Horn et al. 1998, Nucl. Acids. Res. p. 275-279)

8300 mutants

There is supposed to be a tool for search for a sequence pattern in a helix or a loop by means of logical and regular expressions, but we could not find this tool and developed it therefore ourselves, see flan.blm.cs.cmu.edu

There is also supposed to be a BLAST server that allows the user to scan one sequence pattern against all the sequences stored in the GPCRdb, but I could not find this capability either.

Horn, F., E. Bettler, L. Oliveira, F. Campagne, F.E. Cohen, and G. Vriend, GPCRDB information system for G protein-coupled receptors. Nucleic Acids Res, 2003. 31(1): p. 294-7

another update on the same as above

Horn, F., R. Bywater, G. Krause, W. Kuipers, L. Oliveira, A.C. Paiva, C. Sander, and G. Vriend, The interaction of class B G protein-coupled receptors with their hormones. Receptors Channels, 1998. 5(5): p. 305-14.

Kuipers, W., L. Oliveira, G. Vriend, and A.P. Ijzerman, Identification of class-determining residues in G protein-coupled receptors by sequence analysis. Receptors Channels, 1997. 5(3-4): p. 159-74.

"Prediction of GPCR types" K.C. Chou (Structural and Computational Chem., Pharmacia, Kalamazoo, MI, US) and D.W. Elrod (Bioinformatics, Pharmacia, dito) Poster presented at European Protein Society, Florence, Italy 2003.

"develop a fast sequence0based method to identify their different types"

566 GPCR were classified into 7 different types

need to establish a good training data set for accurate prediction