This paper presents a systematic procedure for adding complexity to inferred decision trees without altering their performance on the training data. This procedure has been demonstrated to lead to increases in predictive accuracy for a range of learning tasks when applied to both pruned and unpruned trees inferred by C4.5. For only one of the thirteen learning tasks examined did the procedure lead to a statistically significant loss in accuracy and in this case the magnitude of the difference in mean accuracy was extremely small. On the face of it, this provides strong experimental evidence against the Occam thesis.
This post-processing technique was developed by rejecting the Occam thesis and instead attending to the similarity assumption--that similar objects have high probability of belonging to the same class.
The procedure developed was constrained by the need to ensure that the revised decision tree performed identically to the original decision tree with respect to the training data. This constraint arose from the desire to obtain experimental evidence against the Occam thesis. It is possible that if this constraint is removed, the basic techniques outlined in this paper could result in even greater improvements in predictive accuracy than those reported herein.
This research has considered only one version of Occam's razor that favors minimization of syntactic complexity in the expectation that this will tend to increase predictive accuracy. Other interpretations of Occam's razor are also possible, such as that one should minimize semantic complexity. While others [Bunge, 1963] have provided philosophical objections to such formulations of Occam's razor, this paper has not sought to investigate them.
The version of Occam's razor examined in this research has been used widely in machine learning with apparent success. The objections to this principle that have been substantiated by this research raise the question, why has it had such apparent success if it is so flawed? Webb [1994] suggests that the apparent success of the principle has been due to the manner in which syntactic complexity is usually associated with other relevant qualities of inferred classifiers such as generality or prior probability. If this thesis is accepted then one of the key challenges facing machine learning is to understand these deeper qualities and to employ that understanding to place machine learning on a sounder theoretical footing. This paper offers a small contribution in this direction by demonstrating that minimization of surface syntactic complexity does not, in itself, in general maximize the predictive accuracy of inferred classifiers.
It is nonetheless important to realize that, the thrust of this paper notwithstanding, Occam's razor will often be a useful learning bias to employ. This is because there will frequently be good pragmatic reasons for preferring a simple hypothesis. A simple hypothesis will in general be easier to understand, communicate and employ. A preference for simple hypotheses cannot be justified in terms of expected predictive accuracy but may be justified on pragmatic grounds.