Consistency-Check

Next: Completeness-Check Up: Towards Noise-Tolerant Windowing Previous: Towards Noise-Tolerant Windowing

Consistency-Check

The first choice we have to make is to find a criterion for determining which rules should be added to the final theory, and which rules should be further improved. A straight-forward idea is to use some sort of significance test in order to determine whether a rule that has been learned from the current window is significant on the entire set of training examples. We have experimented with a variety of criteria known from the literature, but found that they are insufficient for our purposes. For example, it turned out that, at higher training set sizes, CN2's likelihood ratio significance test [13], which accepts a rule when the distribution of positive and negative examples covered by the rule differs significantly from the class distribution in the entire set, will deem any rule learned from the window as significant.

Eventually, we have settled for the following criterion: For each rule r learned from the current window we compute two accuracy estimates, AccWin(r) which is determined using only examples from the current window and AccTot(r) which is estimated on the entire training set. The criterion we use for detecting good rules consists of two parts:

1.

The AccWin(r) estimate has to be significantly above the default accuracy of the domain. This is ensured by requiring that

AccWin(r) - SE(AccWin(r)) > DA

(3)

where DA is the default accuracy, and $SE(p) = \sqrt{\frac{p(1-p)}{n}}$ is the standard error of classification [7].

2.

The accuracy of the learned rule on the window has to be within a certain interval of its overall accuracy:¹⁴

$\begin{displaymath}\vert AccWin(r) - AccTot(r)\vert \leq \alpha \times SE(AccTot(r)) \end{displaymath}$

(4)

$\alpha \geq 0$ is a user-settable parameter that can be used to adjust the width of this range.

The purpose of the first heuristic is to avoid rules with a bad classification performance (in particular it weeds out many rules that have been derived from too few training examples), while the second criterion aims at making sure that the accuracy estimates that were computed on the current window (and were used in the heuristics of the learning algorithm) are not too optimistic compared to the true accuracy of the rule, which is approximated by the accuracy measured on the entire training set.

The parameter $\alpha$ determines the degree to which the estimates AccWin(r) and AccTot(r) have to correspond. A setting of $\alpha = 0$ requires that AccWin(r) = AccTot(r), which in general will only be happen if r is consistent or has been learned from the entire training set. This is the recommended setting in noise-free domains. In noisy domains, values of $\alpha > 0$ have to be used because the rules returned from the learning algorithm will typically be inconsistent on the training set. Note, however, that a setting of $\alpha = 0$ in a noisy domain will not lead to over-fitting and a decrease in predictive accuracy because over-fitting is caused by too optimistic estimates of a rule's accuracy. The chance of accepting such rules decreases with the value of $\alpha$ . Clearly, if the chance of a rule being accepted decreases, the run-time of algorithm increases because windowing has to perform more iterations. The extreme case, $\alpha = 0$ , causes the window to grow until it contains all examples. Then the rule is accepted by the second criterion because the examples used for estimating AccWin(r)and AccTot(r) are basically the same. Typical settings in noisy domains are $\alpha = 0.5$ or $\alpha = 1.0$ . $\alpha = \infty$ will move all rules that have survived the first criterion into the final rule set. In as much as the learner is sufficiently noise-tolerant and the initial example size is sufficiently large, the algorithm implements learning from a random subsample if $\alpha = \infty$ .

Next: Completeness-Check Up: Towards Noise-Tolerant Windowing Previous: Towards Noise-Tolerant Windowing