While the above process could be applied to both continuous and discrete attributes, the current implementation addresses only continuous attributes.
The post-processor operates by examining each leaf l of the tree in turn.
For each l, each attribute a is considered in turn. For each a, all
possible thresholds below and above the region of the instance space occupied
by objects at l are explored. First, the minimum (min) and maximum
(max) are determined for values of a that are possible for objects that
can reach l. If l lies below the
<= branch of a split on a then the threshold for that split
provides an upper limit (max) on values for a at l. If it lies below a >
branch, the threshold provides a lower limit (min). Where the node does not
lie below a <= branch, . Where the node does not lie
below a > branch,
. Only objects from the training set that
have values of a within the range min..max are considered in the following
operations.
For each value observed in the training set for the attribute within the allowable range but outside the actual range of values of a for objects at l, the evidence is evaluated in support of reclassifying the region above or below that threshold. The level of support for a given threshold is evaluated using a Laplacian accuracy estimate [Niblett and Bratko, 1986]. Because each leaf relates to a binary classification (an object belongs to the class in question or does not), the binary form of Laplace is used. For threshold t on attribute a at leaf l, the evidence in support of labeling the partition below t with class n is the maximum value for an ancestor node x of l for the formula
where T is the number of objects at x for which min < a <= t; and P is the number of those objects which belong to class n.
The evidence in support of labeling a partition above a threshold is calculated identically with the exception that the objects for which t < a <= max are instead considered.
If the maximum evidence for a new labeling exceeds the evidence for the current labeling of the region, a new branch is added for the appropriate threshold creating a new leaf node labeled with the appropriate class.
In addition to evidence in favor of the current labeling gathered as above, further evidence in support of the current labeling of a region is calculated using the Laplace accuracy estimate considering the objects at the leaf, where T is the number of objects at the leaf and P is the number of those objects that belong to the class with which the node is labeled.
This approach ensures that all new partitions define true regions. That is, for any attribute a and value v it is not possible to partition on a<= v unless it is possible for both objects from the domain with values of a greater than v and objects with values less than or equal to v to reach the node being partitioned (even though no objects from the training set will fall within the new partition). In particular, this ensures that the new cuts are not simple duplications of existing cuts at ancestors to the current node. Thus, every modification adds non-redundant complexity to the tree.
This algorithm is presented in Figure 2. It has been implemented as a modification to C4.5 release 6, called C4.5X. The source code for these modifications is available as an on-line appendix to this paper.
Figure 2: C4.5X post-processing algorithm
In C4.5X, where
multiple sets of values equally satisfy the specified constraints and
maximize the Laplace function, values of and
that are
deeper in the tree are selected over those closer to the root and, at
a single node, preference for values of
and
depends
upon the order of attributes in the definition of the
data and preference for values of
and
is dependent upon data order. These selection strategies
are a side effect of the implementation of the system. There is no
reason to believe that the experimental results would differ in general
if other strategies were used to select between competing constraints.
By default, C4.5 develops two decision trees each time that it is run, an unpruned and a pruned (simplified) decision tree. C4.5X produces post-processed versions of both of these trees.