The Post-processor

Next: Evaluation Up: New Experimental Evidence Against Previous: Theoretical Basis for the

The Post-processor

While the above process could be applied to both continuous and discrete attributes, the current implementation addresses only continuous attributes.

The post-processor operates by examining each leaf l of the tree in turn. For each l, each attribute a is considered in turn. For each a, all possible thresholds below and above the region of the instance space occupied by objects at l are explored. First, the minimum (min) and maximum (max) are determined for values of a that are possible for objects that can reach l. If l lies below the <= branch of a split on a then the threshold for that split provides an upper limit (max) on values for a at l. If it lies below a > branch, the threshold provides a lower limit (min). Where the node does not lie below a <= branch, . Where the node does not lie below a > branch, . Only objects from the training set that have values of a within the range min..max are considered in the following operations.

For each value observed in the training set for the attribute within the allowable range but outside the actual range of values of a for objects at l, the evidence is evaluated in support of reclassifying the region above or below that threshold. The level of support for a given threshold is evaluated using a Laplacian accuracy estimate [Niblett and Bratko, 1986]. Because each leaf relates to a binary classification (an object belongs to the class in question or does not), the binary form of Laplace is used. For threshold t on attribute a at leaf l, the evidence in support of labeling the partition below t with class n is the maximum value for an ancestor node x of l for the formula

where T is the number of objects at x for which min < a <= t; and P is the number of those objects which belong to class n.

The evidence in support of labeling a partition above a threshold is calculated identically with the exception that the objects for which t < a <= max are instead considered.

If the maximum evidence for a new labeling exceeds the evidence for the current labeling of the region, a new branch is added for the appropriate threshold creating a new leaf node labeled with the appropriate class.

In addition to evidence in favor of the current labeling gathered as above, further evidence in support of the current labeling of a region is calculated using the Laplace accuracy estimate considering the objects at the leaf, where T is the number of objects at the leaf and P is the number of those objects that belong to the class with which the node is labeled.

This approach ensures that all new partitions define true regions. That is, for any attribute a and value v it is not possible to partition on a<= v unless it is possible for both objects from the domain with values of a greater than v and objects with values less than or equal to v to reach the node being partitioned (even though no objects from the training set will fall within the new partition). In particular, this ensures that the new cuts are not simple duplications of existing cuts at ancestors to the current node. Thus, every modification adds non-redundant complexity to the tree.

This algorithm is presented in Figure 2. It has been implemented as a modification to C4.5 release 6, called C4.5X. The source code for these modifications is available as an on-line appendix to this paper.

Figure 2: C4.5X post-processing algorithm

In C4.5X, where multiple sets of values equally satisfy the specified constraints and maximize the Laplace function, values of and that are deeper in the tree are selected over those closer to the root and, at a single node, preference for values of and depends upon the order of attributes in the definition of the data and preference for values of and is dependent upon data order. These selection strategies are a side effect of the implementation of the system. There is no reason to believe that the experimental results would differ in general if other strategies were used to select between competing constraints.

By default, C4.5 develops two decision trees each time that it is run, an unpruned and a pruned (simplified) decision tree. C4.5X produces post-processed versions of both of these trees.

Next: Evaluation Up: New Experimental Evidence Against Previous: Theoretical Basis for the

Geoff Webb
Mon Sep 9 12:13:30 EST 1996