next up previous
Next: Discussion Up: New Experimental Evidence Against Previous: The Post-processor

Evaluation

To evaluate the post-processor it was applied to all datasets containing continuous attributes from the UCI machine learning repository [Murphy and Aha, 1993] that were then held (due to previous machine learning experimentation) in the local repository at Deakin University. These datasets are believed to be broadly representative of those in the repository as a whole. After experimentation with these eleven data sets, two additional data sets, sick euthyroid and discordant results, were retrieved from the UCI repository and added to the study in order to investigate specific issues, as discussed below.

The resulting thirteen datasets are described in Table 1. The second column contains the number of attributes by which each object is described. Next is the proportion of these that are continuous. The fourth column indicates the proportion of attribute values in the data that are missing (unknown). The fifth column indicates the number of objects that the data set contains. The sixth column indicates the proportion of these that belong to the class represented by the most objects within the data set. The final column indicates the number of classes that the data set describes. Note that the glass type dataset uses the Float/Not Float/Other three class classification rather than the more commonly used six class classification.

 

% % most
No. of contin- % No. of common No. of
Name Attrs. uous missing objects class classes
breast cancer Wisconsin 9 100 <1 699 66 2
Cleveland heart disease 13 46 <1 303 54 2
credit rating 15 40 1 690 56 2
discordant results 29 24 6 3772 98 2
echocardiogram 6 83 3 74 68 2
glass type 9 100 0 214 40 3
hepatitis 19 32 6 155 79 2
Hungarian heart disease 13 46 20 295 64 2
hypothyroid 29 24 6 3772 92 4
iris 4 100 0 150 33 3
new thyroid 5 100 0 215 70 3
Pima indians diabetes 8 100 0 768 65 2
sick euthyroid 29 24 6 3772 94 2
Table 1: UCI data sets used for experimentation

Each data set was divided into training and evaluation sets 100 times. Each training set consisted of 80% of the data, randomly selected. Each evaluation set consisted of the remaining 20% of the data. Both C4.5 and C4.5X were applied to each of the resulting 1300 (13 data sets by 100 trials) training and evaluation set pairs.

Table 2 summarizes the percentage predictive accuracy obtained for the unpruned decision trees generated by both C4.5 and C4.5X. It presents the mean ( ) and standard deviation (s) over each set of 100 trials with respect to each data set for both C4.5 and C4.5X along with the results of a two-tailed matched pairs t-test comparing these means. For twelve of the thirteen data sets C4.5X obtained a higher mean accuracy than C4.5. For the remaining data set, hypothyroid, C4.5 obtained higher mean predictive accuracy than C4.5CS (albeit by a small margin--measured to two decimal places the respective mean accuracies were 99.51 and 99.46, respectively). For nine of the data sets the advantage toward C4.5X is statistically significant at the 0.05 level (p<=0.05), although the advantage with respect to the discordant results data is too small to be apparent when measured to one decimal place (measured to two decimal places the values are 98.58 and 98.62 respectively). The advantage toward C4.5 for the hypothyroid data is also statistically significant at the 0.05 level. The differences in mean predictive accuracy for the Hungarian heart disease, new thyroid and sick euthyroid data sets are not significant at the 0.05 level.

 

C4.5C4.5X
Name mean s mean s t p
breast cancer Wisconsin 94.1 1.8 94.4 1.7 -3.2 0.002
Cleveland heart disease 72.8 5.0 74.4 4.8 -6.1 0.000
credit rating 82.2 3.4 83.0 3.3 -7.6 0.000
discordant results 98.6 0.5 98.6 0.5 -5.4 0.000
echocardiogram 72.0 9.8 73.5 10.2 -2.8 0.007
glass type 74.0 7.0 75.3 7.2 -4.2 0.000
hepatitis 79.6 7.1 80.8 6.9 -3.3 0.001
Hungarian heart disease 77.0 5.3 77.4 5.2 -1.8 0.082
hypothyroid 99.5 0.2 99.5 0.2 4.4 0.000
iris 95.4 3.4 95.7 3.5 -2.2 0.028
new thyroid 89.9 4.2 90.1 4.3 -1.0 0.302
Pima indians diabetes 70.2 3.5 71.3 3.6 -8.1 0.000
sick euthyroid 98.7 0.5 98.7 0.5 -0.0 0.963
Table 2: Percentage predictive accuracy for unpruned decision trees.

 

C4.5C4.5X
Name mean s mean s t p
breast cancer Wisconsin 95.1 1.7 95.2 1.7 -2.0 0.051
Cleveland heart disease 74.1 5.3 74.8 5.3 -3.7 0.000
credit rating 84.1 3.2 84.6 3.2 -5.3 0.000
discordant results 98.8 0.4 98.8 0.4 -2.6 0.010
echocardiogram 74.2 9.3 75.1 9.8 -1.6 0.1180
glass type 74.4 6.9 75.4 6.9 -3.3 0.001
hepatitis 79.9 6.2 80.7 6.2 -3.0 0.003
Hungarian heart disease 79.2 4.9 79.4 4.8 -1.0 0.310
hypothyroid 99.5 0.2 99.5 0.2 5.4 0.000
iris 95.4 3.6 95.7 3.7 -1.6 0.109
new thyroid 89.6 4.2 89.8 4.2 -0.8 0.451
Pima indians diabetes 72.2 3.5 72.8 3.5 -5.9 0.000
sick euthyroid 98.7 0.4 98.7 0.4 -0.7 0.480
Table 3: Percentage accuracy for pruned decision trees.

Table 3 uses the same format as Table 2 to summarize the predictive accuracy obtained for the pruned decision trees generated by both C4.5 and C4.5X. For the same twelve data sets C4.5X obtained a higher mean predictive accuracy than C4.5. For the remaining data set, hypothyroid, C4.5 again obtained higher mean predictive accuracy, although again the magnitude of the difference is so small that it is not apparent at the level of precision displayed (measured to two decimal places the mean accuracies are 99.51 and 99.46). For six of the data sets the advantage toward C4.5X is statistically significant at the 0.05 level, although the difference is only apparent at a precision of two decimal places for the discordant results data (99.81 and 99.82, respectively). The advantage toward C4.5 for the hypothyroid data is also statistically significant at the 0.05 level. The differences for breast cancer Wisconsin, echocardiogram, Hungarian heart disease, iris, new thyroid and sick euthyroid are not statistically significant at the 0.05 level.

After completing experimentation on the initial eleven data sets, the results for the hypothyroid data stood out in stark contrast from those for the other ten. This raised the possibility that there might be distinguishing features of the hypothyroid data that accounted for this difference in performance. Table 1 indicates this data set is clearly distinguishable from the other ten initial data sets in the following six respects--

To explore these issues the discordant results and sick euthyroid data sets were retrieved from the UCI repository and added to the study. These data sets are identical to the hypothyroid data set with the exception that each has a different class attribute. All three data sets contain the same objects, described by the same attributes. The addition of the discordant results and sick euthyroid data did little to illuminate this issue however. For all three data sets the changes in accuracy are of very small magnitude. For hypothyroid there is a significant advantage to C4.5. For sick euthyroid there is no significant advantage to either system. For the discordant results data there is a significant advantage to C4.5X.

The question of whether there is a distinguishing feature of the hypothyroid data that explains the observed results remains unanswered. Further investigation of this issue lies beyond the scope of the current paper but remains an interesting direction for future research.

These results suggest that C4.5X's post-processing more frequently increases predictive accuracy than not for the type of data to be found in the UCI repository. (Of the twenty-six comparisons, there was a significant increase for fifteen and there was a significant decrease for only two. A sign test reveals that this rate of success is significant at the 0.05 level, p=0.001.)

Tables 4 and 5 summarize the number of nodes in the decision trees developed. Table 4 addresses unpruned decision trees and Table 5 addresses pruned decision trees. Each post-processing modification replaces a single leaf with a split and two leaves. At most one such modification can be performed per leaf in the original tree. For all data sets the post-processed decision trees are significantly more complex than the original decision trees. In most cases post-processing has increased the mean number of nodes in the decision trees by approximately 50%. This demonstrates that the post-processing is causing substantial change.

 

C4.5C4.5X
Name mean s mean s t p
breast cancer Wisconsin 38.1 6.0 64.0 10.3 -51.5 0.000
Cleveland heart disease 66.7 7.1 100.2 11.3 -61.9 0.000
credit rating 117.6 18.1 177.9 28.4 -44.2 0.000
discordant results 64.0 10.6 85.2 16.2 -33.3 0.000
echocardiogram 15.4 4.1 22.1 6.3 -26.1 0.000
glass type 43.0 5.2 69.7 8.4 -57.2 0.000
hepatitis 24.5 4.2 34.8 6.0 -49.1 0.000
Hungarian heart disease 62.1 7.5 94.8 13.0 -50.1 0.000
hypothyroid 29.4 4.4 47.5 7.1 -57.8 0.000
iris 9.0 1.9 16.0 4.0 -31.5 0.000
new thyroid 14.7 2.4 23.4 3.8 -41.5 0.000
Pima indians diabetes 164.8 10.8 238.8 16.3 -108.9 0.000
sick euthyroid 71.7 6.6 111.4 12.1 -65.8 0.000
Table 4: Number of nodes for unpruned decision trees.

 

C4.5C4.5X
Name mean s mean s t p
breast cancer Wisconsin 19.2 5.0 33.1 8.6 -34.9 0.000
Cleveland heart disease 44.6 8.3 68.3 12.8 -43.6 0.000
credit rating 51.2 14.8 78.4 24.2 -25.8 0.000
discordant results 24.9 5.6 32.5 8.8 -21.1 0.000
echocardiogram 10.4 3.0 14.8 4.8 -21.0 0.000
glass type 36.6 5.5 61.0 9.5 -48.5 0.000
hepatitis 13.7 4.8 19.8 6.6 -30.7 0.000
Hungarian heart disease 26.8 11.4 41.2 17.3 -22.1 0.000
hypothyroid 23.6 2.9 37.1 5.6 -46.7 0.000
iris 8.2 1.9 14.8 3.9 -30.3 0.000
new thyroid 14.1 2.7 22.5 4.3 -36.9 0.000
Pima indians diabetes 112.0 16.4 163.9 24.0 -62.5 0.000
sick euthyroid 46.5 5.8 72.6 8.7 -76.7 0.000
Table 5: Number of nodes for pruned decision trees.


next up previous
Next: Discussion Up: New Experimental Evidence Against Previous: The Post-processor

Geoff Webb
Mon Sep 9 12:13:30 EST 1996