(For paper:
"Evaluation of different biological data and computational classification
methods for use in protein interaction prediction".)
1. How did
you handle the missing values when classifying samples using SVM?
Ø For classifiers could not handle
missing value in features, we just use the fill-missing value strategy: For
numerical features, fill with mean ; For categorical features, fill with the
most frequent value.
2. Which
kernel did you use for SVM?
Ø We compared the polynomial (-d 2)
kernel and linear kernel using SVMLight. We reported the linear kernel results
in the paper. We also varied the cost factor parameters and chose the best to
report based on train-test results.
3. How did
you handle the numerical features in the Naïve Bayes classifier ?
Ø We choose to use the “- K” choice in
4. For the evaluation,
we repeated the procedure 25 times and reported the average value. How is the
average value derived in terms of precision – recall curves?
Ø In each test run, by changing the
score cutoffs, we got a precision-recall curve. So totally we get 25 precision
vs. recall curves. Then for these 25 curves, we average the precision values
for each fixed recall value.
5. How did you generate the scores for
“Computational Predictions Downloads” and why did you put them online?
Ø The scores are at: http://www.cs.cmu.edu/afs/cs.cmu.edu/project/structure-9/PPI/protein05/twoTasks-fullPredict/
Ø We are currently trying to build a
web-service to provide retrieval functions for computational PPI predicted
scores. So generating the above scores are for providing data for this
web-service. And these scores could also give some help to people who are more
interested to use predicted PPI scores other than the evaluation comparison in
the paper.
Ø Current shared scores are generated
based on the following procedure:
1. The features are as paper's Table
III (detailed encoding type).
2. The positive train set are from DIP
physical PPI data and MIPS co-complex data ( Table II).
3. The score calculation:
-
First
train a SVM model based on a training data set including all positive PPI of
that task and random negative PPI data examples
-
Then
classify all possible Yeast protein-protein pairs ( 6270 * 6269 / 2) based on the above derived model
-
(
In old shared file), for all these pairs, rank them based on the classification
score
-
(
In old shared file), from the score rank list, the top ~20000 pairs were
reported.
Ø I am hoping to update these scores
frequently. ( Currently the reason that we use SVM to classify all potential
pairs is due to RF is relatively too slow. )
Ø If you want predicted scores before
calibration or you want other formats / size of predicted PPI, please feel free
to contact
me.
6. Would
you share your code for generating the features?
Ø Here
I share the code and related files to
generate our feature set. Download (both
summary and detailed !). The general framework and the codes should be quite
useful. You could try to find more recent versions of related evidence sets to
make improvement though