Back to "Homepage of Judith Klein-Seetharaman"
Back to "G-protein coupled receptors"
G-protein coupled receptor classification
Task:
GPCR
ClassA, Class B, ClassC etc. (considered roughly analogous to SCOP families, Reference for use of SVF for SCOP superfamily classification: Jaakola et al. (2000))
Level 1 subfamily of Class A: amine, peptide, opsin, viral ...
Level 2 subfamily of Amine: muscarinic, serotonin, histamine, dopamine ...
Approach:
compare different classification methods:
1. simple nearest neighbor approach (BLAST) [original BLAST reference is Altschul, S.F., W. Gish, W. Miller, E.W. Myers, and D.J. Lipman, Basic local alignment search tool. J Mol Biol, 1990. 215(3): p. 403-10; WU-BLAST used here, reference is http://blast.wustl.edu
2. multiple sequence alignment (HMM): a sophisticated nearest neighbor approach that bulds a library of statistical odels for protein classes of interest . THe target sequence is scored against all models in the library and classified according to the model that gives the best score to the sequence [SAM-T2K HMM: Hughey, R. and A. Krogh, Hidden Markov models for sequence analysis: extension and analysis of the basic method. Comput Appl Biosci, 1996. 12(2): p. 95-107; Karplus, K., C. Barrett, and R. Hughey, Hidden Markov models for detecting remote protein homologies. Bioinformatics, 1998. 14(10): p. 846-56]
3. methods that transform protein sequences into fixed-length feature vectors (SVM): transform protein sequence into Fisher Score Vector (FSV) space and build library of SVM of each protein class and score each test sequence against the entire library as in the HMM experiment (SVMtree is a more efficient way to do the same)
Results from 2-fold cross-validation experiments, shown as ROC curves (number of false positives versus coverage)
Level 2 subfamily classification:
Errors per sequence at the Minimum Error Point (MEP):
13.7% for multi-class SVM best method at this level
17.1% for SVMtree of hierarchical multi-class SVM
25.5% for BLAST
30% for profile HMM
49% nearest neighbor feature vector Kernel Nearest Neighbor (kernNN)
Percentage of true positives before the first false positive (coverage):
65% both SVMs
13% BLAST
5% profile HMM
4% kernNN
Level 1 subfamily classification:
average error at MEP:
11.6% for SVM best method at this level
16.7% for BLAST
30.1% for SAM-T2K HMM
36.0% for kernNN
Coverage (%)
48% for SVM
29% for BLAST
6% for SAM-T2K HMM
38% for kernNN
GPCR or not (trained on positive examples = Class B-E, negative examples from SCOP and archaeal rhodopsins and G proteins, tested on Class A):
average error at MEP:
0.04 % for SAM-T99 HMM best method at this level
0.22 % for SVM
6.82 % for FPS BLAST
Coverage:
98 % for SAM-T99 HMM
72 % for SVM
3 % for FPS BLAST
Webserver:
subfamily classification based on hierarchical multi-class SVM at:
http://www.soe.ucsc.edu/research/compbio/gpcr-subclass
list of predictions for human GPCR:
http://www.soe.ucsc.edu/research/compbio/gpcr_hg/class_result
Reference for previous classification:
Horn, F., J. Weare, M.W. Beukers, S. Horsch, A. Bairoch, W. Chen, O. Edvardsen, F. Campagne, and G. Vriend, GPCRDB: an information system for G protein-coupled receptors. Nucleic Acids Res, 1998. 26(1): p. 275-9
the organization into classes is originally based on the pharmacological classification of receptors
classes share > 20% sequence identity over predicted transmembrane helices
Multiple sequence alignment designed specifically for GPCR based on WHAT IF (using neighbor-joining algorithm): ref. Oliveira, L., A.C. Paiva, and G. Vriend, J Comp.-Aid. Mol. Des., 1993. 7(649-648)
In this method, the sequences are numbered such that 100s digits indicate the helix number and the most conserved residue in every helix has a round number. However, we found that this is sometimes wrong and parts of the sequence got 100s numbers that are not TM helices.
Horn, F., G. Vriend, and F.E. Cohen, Collecting and harvesting biological data: the GPCRDB and NucleaRDB information systems. Nucleic Acids Res, 2001. 29(1): p. 346-9
same paper as Horn et al. (1998) but updated
July 2000:
1796 sequences
1290 structures
12000 ligand binding data (dissociation constants), manually extracted from the literature by P. Seeman (see ref. in Horn et al. 1998, Nucl. Acids. Res. p. 275-279)
8300 mutants
There is supposed to be a tool for search for a sequence pattern in a helix or a loop by means of logical and regular expressions, but we could not find this tool and developed it therefore ourselves, see flan.blm.cs.cmu.edu
There is also supposed to be a BLAST server that allows the user to scan one sequence pattern against all the sequences stored in the GPCRdb, but I could not find this capability either.
Horn, F., E. Bettler, L. Oliveira, F. Campagne, F.E. Cohen, and G. Vriend, GPCRDB information system for G protein-coupled receptors. Nucleic Acids Res, 2003. 31(1): p. 294-7
another update on the same as above
Horn, F., R. Bywater, G. Krause, W. Kuipers, L. Oliveira, A.C. Paiva, C. Sander, and G. Vriend, The interaction of class B G protein-coupled receptors with their hormones. Receptors Channels, 1998. 5(5): p. 305-14.
Kuipers, W., L. Oliveira, G. Vriend, and A.P. Ijzerman, Identification of class-determining residues in G protein-coupled receptors by sequence analysis. Receptors Channels, 1997. 5(3-4): p. 159-74.
"Prediction of GPCR types" K.C. Chou (Structural and Computational Chem., Pharmacia, Kalamazoo, MI, US) and D.W. Elrod (Bioinformatics, Pharmacia, dito) Poster presented at European Protein Society, Florence, Italy 2003.
"develop a fast sequence0based method to identify their different types"
566 GPCR were classified into 7 different types
need to establish a good training data set for accurate prediction