Let us now return to the question of why windowing seems to work better with rule learning algorithms. We believe that its way of skewing the example distribution has different effects on divide-and-conquer decision tree learning algorithms and on separate-and-conquer rule learning algorithms. The nature of this difference lies in the way a condition is selected. Typically, a separate-and-conquer rule learner selects the test that maximizes the number of covered positive examples and at the same time minimizes the number of negative examples that pass the test. It usually does not pay any attention to the examples that do not pass the test. The selection of a condition in a divide-and-conquer decision tree learning algorithm, on the other hand, tries to optimize for all outcomes of the test. For example, the decision tree learner C4.5 [53] selects a condition by minimizing the average entropy over all its outcomes,6 while the original version of the rule learner CN2 [13] selects the test that minimizes the entropy for all examples that pass the selected test.
The consequence of this difference for the windowing process is that additional examples in the window of a decision tree learner will have a strong influence on the selection of tests at all levels of the tree. In particular, all examples have an equal influence on the selection of the root attribute. Thus, windowing's way of skewing the example distribution may cause significant changes in the learned tree. Separate-and-conquer rule learning algorithms, on the other hand, evaluate a rule using only the examples that are covered by it. As no examples will be added for an already correctly learned rule, the evaluation of this rule will not change if more examples are added to the window by incorrect rules. There might be some changes in the evaluation of some of its tests, because an incomplete rule can cover some of the uncovered positive examples or covered negative examples that were added in the last iteration of the windowing algorithm, but in general we think that this influence is not nearly as significant as for decision tree learning algorithms, where it is certain that each added example will influence the evaluation of the tests at the root of the tree.
More evidence for the sensitivity of ID3 to changes in the example distribution also comes from some experiments with C4.5 [53], where it turned out that changing windowing in a way that makes the class distribution in the initial window as uniform as possible produces better results.7 As discussed above, we believe that this sensitivity is caused by the need of decision tree algorithms to optimize the class distribution in all successor nodes of an interior node. On the other hand, different class distributions will only have a minor effect on separate-and-conquer learners, because they are learning one rule at a time. Adding uncovered positive examples to the current window will not alter the evaluation of rules that do not cover the new examples, but it may cause the selection of a new root node in decision tree learning. We hypothesize from these deliberations that for windowing, rule learning algorithms exhibit more stability than decision tree learning algorithms. This, in turn, can lead to better run-times because the newly added examples will not interact with the parts of theory that have already been learned well.