In the fourteenth century William of Occam stated ``plurality should not be assumed without necessity''. This principle has since become known as Occam's razor. Occam's razor was originally intended as a basis for determining one's ontology. However, in modern times it has been widely reinterpreted and adopted as an epistemological principle--a means of selecting between alternative theories as well as ontologies. Modern reinterpretations of Occam's razor are widely employed in classification learning. However, the utility of this principle has been subject to widespread theoretical and experimental attack. This paper adds to this debate by providing further experimental evidence against the utility of the modern interpretation of Occam's razor. This evidence takes the form of a systematic procedure for adding non-redundant complexity to classifiers in a manner that is demonstrated to frequently improve predictive accuracy.
The modern interpretation of Occam's razor has been characterized as ``of two hypotheses H and H, both of which explain E, the simpler is to be preferred'' [Good, 1977]. However, this does not specify what aspect of a theory should be measured for simplicity. Syntactic, semantic, epistemological and pragmatic simplicity are all alternative criteria that can and have been employed Bunge [1963]. In practice, the common use of Occam's razor in machine learning seeks to minimize surface syntactic complexity. It is this interpretation that this paper addresses.
It is to be assumed that Occam's razor is usually applied in the expectation that its application will, in general, lead to some particular form of advantage. There is no widely accepted articulation of precisely how Occam's razor should be applied or what advantages are to be expected from its application in classification learning. However, the literature does contain two statements that seem to capture at least one widely adopted approach to the principle. Blumer, Ehrenfeucht, Haussler, and Warmuth [1987] suggest that to wield Occam's razor is to adopt the goal of discovering ``the simplest hypothesis that is consistent with the sample data'' with the expectation that the simplest hypothesis will ``perform well on further observations taken from the same source''. Quinlan [1986] states
``Given a choice between two decision trees, each of which is correct over the training set, it seems sensible to prefer the simpler one on the grounds that it is more likely to capture structure inherent in the problem. The simpler tree would therefore be expected to classify correctly more objects outside the training set.''While these statements would not necessarily be accepted by all proponents of Occam's razor, they capture the form of Occam's razor that this paper seeks to address--a learning bias toward classifiers that minimize surface syntactic complexity in the expectation of maximizing predictive accuracy.
Both of the above statements of Occam's razor restrict themselves to classifiers that correctly classify all objects in a training set. Many modern machine learning systems incorporate learning biases that tolerate small levels of misclassification of the training data [Clark and Niblett, 1989, Michalski, 1984, Quinlan, 1986, Quinlan, 1990, for example,]. In this context, and extending the scope of the definition beyond decision trees to classifiers in general, it seems reasonable to modify Quinlan's [1986] statement (above) to
Given a choice between two plausible classifiers that perform identically on the training set, the simpler classifier is expected to classify correctly more objects outside the training set.This will be referred to as the Occam thesis.
The concept of identical performance on a training set could be defined in many different ways. It might be tempting to opt for a definition that requires identical error rates when two classifiers are applied to the training set. A less strict interpretation might allow two classifiers to have differing error rates so long as the difference is within some statistical confidence limit. However, to maximize the applicability of its results, this paper will adopt a very strict interpretation of identical performance--that for every object o in the training set, both classifiers provide the same classification for o.
It should be noted that the Occam thesis is not claiming that for any two classifiers with equal empirical support the least complex will always have greater predictive accuracy on previously unseen objects. Rather, it is claiming that more frequently than not the less complex will have higher predictive accuracy.
This paper first examines some arguments for and against the Occam thesis. It then presents new empirical evidence against the thesis. This evidence was acquired by using a learning algorithm that post-processes decision trees learnt by C4.5. This post-processor was developed by rejecting the Occam thesis and instead attending to the assumption that similarity is predictive of class. The post-processor systematically adds complexity to decision trees without altering their performance on the training data. This is demonstrated to lead to an increase in predictive accuracy on previously unseen objects for a range of `real-world' learning tasks. This evidence is taken as incompatible with the Occam thesis.