The first choice we have to make is to find a criterion for determining which rules should be added to the final theory, and which rules should be further improved. A straight-forward idea is to use some sort of significance test in order to determine whether a rule that has been learned from the current window is significant on the entire set of training examples. We have experimented with a variety of criteria known from the literature, but found that they are insufficient for our purposes. For example, it turned out that, at higher training set sizes, CN2's likelihood ratio significance test [13], which accepts a rule when the distribution of positive and negative examples covered by the rule differs significantly from the class distribution in the entire set, will deem any rule learned from the window as significant.
Eventually, we have settled for the following criterion: For each rule r learned from the current window we compute two accuracy estimates, AccWin(r) which is determined using only examples from the current window and AccTot(r) which is estimated on the entire training set. The criterion we use for detecting good rules consists of two parts:
where DA is the default accuracy, and is the standard error of classification [7].
is a user-settable parameter that can be used to adjust the width of this range.
The parameter determines the degree to which the estimates AccWin(r) and AccTot(r) have to correspond. A setting of requires that AccWin(r) = AccTot(r), which in general will only be happen if r is consistent or has been learned from the entire training set. This is the recommended setting in noise-free domains. In noisy domains, values of have to be used because the rules returned from the learning algorithm will typically be inconsistent on the training set. Note, however, that a setting of in a noisy domain will not lead to over-fitting and a decrease in predictive accuracy because over-fitting is caused by too optimistic estimates of a rule's accuracy. The chance of accepting such rules decreases with the value of . Clearly, if the chance of a rule being accepted decreases, the run-time of algorithm increases because windowing has to perform more iterations. The extreme case, , causes the window to grow until it contains all examples. Then the rule is accepted by the second criterion because the examples used for estimating AccWin(r)and AccTot(r) are basically the same. Typical settings in noisy domains are or . will move all rules that have survived the first criterion into the final rule set. In as much as the learner is sufficiently noise-tolerant and the initial example size is sufficiently large, the algorithm implements learning from a random subsample if .