Some results Fan Li's homepage

Research and Projects

* Some of the TC experimental results

1. Results on R21578 (90 categories in training and testing set, avg. #cat/doc =1.23 in training set and 1.24 in testing set)

2. A link to the structure graph of Reuters 2001 collection

3. Results on TREC10 data (84 topic categories)

4. Results on R2001(Topic code) data (101 categories in training set,#cat/doc=3.10, 103 categories in test set,#cat/doc=3.21. "None" category is not counted)

5. Results on R2001(Industry code) data (313 categories in training set ,#cat/doc=1.24, 350 categories in test set, #cat/doc=1.25, "None" category is excluded)

6. Results on R2001(Region part) data (228 categories in training set,#cat/doc=1.28, 296 categories in test set, #cat/doc=1.32, None category is excluded)

7. Results on Updated R2001(Topic code) data(781265 test documents)

Result summarize

RCV1(topic code) RCV1(industry code) RCV1(region code)

783484 test set 781265 test set 783484 test set 781265 test set 783484 test set 781265 test set

Micro avg. F1 Macro avg. F1 Micro avg. F1 Macro avg. F1 Micro avg. F1 Macro avg. F1 Micro avg. F1 Macro avg. F1 Micro avg. F1 Macro avg. F1 Micro avg. F1 Macro avg. F1

kNN
0.7666

0.5374

0.7673

0.5379

0.3952

0.2103

0.3955

0.2106

0.7909

0.4531

0.7908

0.4531

Rocchio
0.6950

0.4994

0.6957

0.4998

0.3838

0.1523

0.3842

0.1525

0.7929

0.4405

0.7932

0.4406

NB(multi-class)
0.6700

0.4606

0.6714

0.4619

0.3114

0.0719

0.3118

0.0720

0.7544

0.3682

0.7549

0.3684

SVM
0.8096

0.6032

0.8104

0.6036

0.5121

0.2659

0.5122

0.2660

0.8726

0.4632

0.8726

0.4632

Results on 3 schemes(topic code has not been updated)

Micro avg F1 Macro avg F1

KNN 0.765 0.549

Rocchio 0.693 0.495

NB(multi_class) 0.625 0.470

NB(binary version) 0.666 0.437

SVM1 0.816 0.607

SVM2 0.810 0.546

SVM2 with SVM1's thresholds 0.791 0.594

Results on updated RCV1 topic part

	RCV1(topic code)	RCV1(industry code)	RCV1(region code)
783484 test set	781265 test set	783484 test set	781265 test set	783484 test set	781265 test set
Micro avg. F1	Macro avg. F1	Micro avg. F1	Macro avg. F1	Micro avg. F1	Macro avg. F1	Micro avg. F1	Macro avg. F1	Micro avg. F1	Macro avg. F1	Micro avg. F1	Macro avg. F1
kNN	0.7666	0.5374	0.7673	0.5379	0.3952	0.2103	0.3955	0.2106	0.7909	0.4531	0.7908	0.4531
Rocchio	0.6950	0.4994	0.6957	0.4998	0.3838	0.1523	0.3842	0.1525	0.7929	0.4405	0.7932	0.4406
NB(multi-class)	0.6700	0.4606	0.6714	0.4619	0.3114	0.0719	0.3118	0.0720	0.7544	0.3682	0.7549	0.3684
SVM	0.8096	0.6032	0.8104	0.6036	0.5121	0.2659	0.5122	0.2660	0.8726	0.4632	0.8726	0.4632

Some graphic Performance analysis

A feature selection graph,which shows the different sensitivity to feature number of different classifiers
A catfre VS F1 graph on the updated R2001(topic) data set ,which shows the catfre-f1 curve of different classifiers
A SVM classifier catfre VS F1 graph on the 3 schemes of R2001 ,the most interesting phenomenome is the saturation on industry category.

Time cost analysis

In time complexity analysis, let S=number of training documents, V=number of words, C=number of possible classes, L=average length of a document .L'=average number of features in a document, k=the length of decision tree. So the average document frequency in a inverse index is L'*S/V, The time complexity of flat classifier is like following:

Training cost Testing cost

Rocchio

O(LS)+O(CV)

O(L'C)

NB(multi-class version)

O(LS)

O(LC)

KNN

O(LS)

O(L'LS/V)+O(Slog(S))

SVM

the key component of SVM's training time is the quadratic optimization time. When training documents>10000, It can be a bottleneck. According to [Thorsten] SVM's training time is roughly comparable to C4.5

The test cost can be written as O(xCL'LS/V). ,and more expensive than NB,Rocchio and KNN. SVM's test speed are faster than KNN.

C4.5

O(kscv)

	Training cost	Testing cost
Rocchio	O(LS)+O(CV)	O(L'C)
NB(multi-class version)	O(LS)	O(LC)
KNN	O(LS)	O(L'LS/V)+O(Slog(S))
SVM	the key component of SVM's training time is the quadratic optimization time. When training documents>10000, It can be a bottleneck. According to [Thorsten] SVM's training time is roughly comparable to C4.5	The test cost can be written as O(xCL'LS/V). ,and more expensive than NB,Rocchio and KNN. SVM's test speed are faster than KNN.
C4.5	O(kscv)

Here is a table about the time these classifiers used when running on R21578(90 categories) (parameter tuning time is not included)

Training cost Testing cost Total cost

Rocchio (all features used)

Totally 2010.49 CPU seconds.

KNN(k=10) (all features used)

Totally 124.15 CPU seconds, mostly spent in test phrase

NB (multi-class version) (all features used)

6.31 CPU seconds

11.68 CPU seconds

27.68 CPU seconds

NB(binary version) (all features used)

371.8 CPU seconds

159.66 CPU seconds

964.58 CPU seconds

SVM (all features used)

432.04 CPU seconds

127.03 CPU seconds

658.36 CPU seconds

	Training cost	Testing cost	Total cost
Rocchio (all features used)	Totally 2010.49 CPU seconds.
KNN(k=10) (all features used)	Totally 124.15 CPU seconds, mostly spent in test phrase
NB (multi-class version) (all features used)	6.31 CPU seconds	11.68 CPU seconds	27.68 CPU seconds
NB(binary version) (all features used)	371.8 CPU seconds	159.66 CPU seconds	964.58 CPU seconds
SVM (all features used)	432.04 CPU seconds	127.03 CPU seconds	658.36 CPU seconds

Here is a table about the time these classifiers used when running on RCV1(industry area) (313 categories in training set)

Training cost Testing cost Total cost

Rocchio (10000 top features used)

Totally 36160.52 CPU seconds.

KNN (10000 top features used)

Totally 67369.38 CPU seconds, mostly spent in test phrase

NB (multi-class version) (10000 features used)

8.9 CPU seconds

3937 CPU seconds

29161 CPU seconds

NB(binary version)

CPU seconds

CPU seconds

CPU seconds

SVM (all features used)

CPU seconds

CPU seconds

CPU seconds

	Micro avg F1	Macro avg F1
KNN	0.765	0.549
Rocchio	0.693	0.495
NB(multi_class)	0.625	0.470
NB(binary version)	0.666	0.437
SVM1	0.816	0.607
SVM2	0.810	0.546
SVM2 with SVM1's thresholds	0.791	0.594

	Training cost	Testing cost	Total cost
Rocchio (10000 top features used)	Totally 36160.52 CPU seconds.
KNN (10000 top features used)	Totally 67369.38 CPU seconds, mostly spent in test phrase
NB (multi-class version) (10000 features used)	8.9 CPU seconds	3937 CPU seconds	29161 CPU seconds
NB(binary version)	CPU seconds	CPU seconds	CPU seconds
SVM (all features used)	CPU seconds	CPU seconds	CPU seconds