Footnotes

Quinlan does not explicitly specify how this case should be handled, but we think it makes sense that way.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

... examples.²

Such a linear growth was already observed by [50] in some early experiments with windowing.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

... DOS ³

Dull Old Separate-and-conquer.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

... illegal.⁴

Note, however, that not all different positions form a different feature vector with the specified features as has been pointed out by [49].

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

... data.⁵

This is even worse in the usual representation of this problem [46], where instead of the between relation, a less_than predicate is given in the background knowledge. In this representation, which will be used in all our experiments, rules 6 and 7 have to be reformulated using the less_than predicate, which is only possible if each of them is replaced with two new rules.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

... outcomes,⁶

Actually, C4.5 maximizes information gain, which is computed by subtracting the average entropy from an information value that is constant for all considered tests.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

... results.⁷

This equal-frequency sub-sampling method has first been described by [7] for dynamically drawing a subsample at each node in a decision tree from which the best split for this node is determined. In C4.5 it is used to seed the initial window.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

... algorithm.⁸

Measured in CPU seconds of a microSPARC 110MHz running compiled Allegro Common Lisp code under SUN Unix 4.1.3.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

...).⁹

We did not compute this measurement for the experiments in noisy domains (section 5) because the way we compute the completeness check (section 5.2.2) invalidates these measurements.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

... measure.¹⁰

The measure could not be computed for the Shuttle domain, because this domain contains numerical attributes, which cannot be handled in a straight-forward fashion.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

... domain ¹¹

Note that this example set is only a subset of the Tic-Tac-Toe data set that has been studied by [59]. We did not have access to the full data set.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

...[23].¹²

The sorting causes the slightly super-linear slope in the run-time curves of DOS.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

... algorithms,¹³

Think, e.g., of Ken Thompson's 3 CD-Roms of chess endgames, to which every competitive chess program has an interface. A compression of these endgames into a simple set of rules would certainly be appreciated by the computer chess industry [27].

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

... accuracy:¹⁴

Note that (4) differs from the version of [29] by using SE(AccTot(r)) instead of SE(AccWin(r)) on the right hand side. We have found this version to work somewhat more reliably.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

... WIN-RIP-NT.¹⁵

NT, of course, stands for Noise-Tolerant. Special thanks to the reviewer who suggested this notation.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

... $\alpha$ .¹⁶

In a new domain, we would advice to try $\alpha = 1.0$ and $\alpha = 0.5$ in that order.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

... space.¹⁷

In representation languages that extend flat feature vectors, such as first-order logic, the complexity of a learning problem also depends crucially on the average cost of matching an instance with a rule. [54] demonstrate a technique for reducing these potentially exponential costs via sub-sampling.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.