A Closer Look at External Validation Criteria

Next: Evaluating Objective Functions: Up: General Discussion Previous: General Discussion

A Closer Look at External Validation Criteria

Ideally, clustering quality as measured by the objective function should be well correlated with clustering utility as determined by a performance task: the higher the quality of a clustering as judged by the objective function, the greater the performance improvement (e.g., reduction of error rate), and the lower the quality, the less that performance improves. However, several authors [Fisher et al., 1992; Nevins, 1995; Devaney & Ram, 1993] have pointed out that PU scores do not seem well-correlated with error rates. More precisely, hierarchical clusterings (constructed by hierarchical sorting) in which the top-level partition has a low PU score lead to roughly the same error rates as hierarchies in which the top-level partition has a high PU score, when variable-value predictions are made at leaves (singleton clusters). Apparently, even with poor partitions at each level as measured by PU, test observations are classified to the same or similar observations at the leaves of a hierarchical clustering. Pattern-completion error rate under these circumstances seems insufficient to discriminate what we might otherwise consider to be good and poor clusterings.

Our work on simplification in Section 4 suggests that in addition to error rate, we might choose to judge competing hierarchical clusterings based on simplicity or some similarly-intended criterion. Both error rate and simplicity are used to judge classifiers in supervised contexts. We have seen that holdout can be used to substantially `simplify' a hierarchical clustering. The question we now ask is whether hierarchical clusterings that have been optimized relative to PU can be simplified more substantially than unoptimized clusterings with no degradation in pattern-completion error rate?

To answer this question we repeated the validation experiments of Section 4.2 under a second experimental condition: hierarchical clusterings were constructed from similarity orderings of the observations using hierarchical sorting. We saw in Section 3.4 that similarity orderings tend to result in clusterings judged poor by the PU function. We do not optimize these hierarchies using hierarchical optimization. Table 7 shows accuracies, number of leaves, and average frontier sizes, for unoptimized hierarchies constructed from similarity orderings in the case where they have been subjected to holdout-based validation and in the case where they have not. These results are given under the heading `Unoptimized'. For convenience, we copy the results of Table 6 under the heading `Optimized'.

Table 7: Characteristics of unoptimized and optimized clusterings before and after validation. Average and standard deviations over 20 trials.

As in the optimized case, identifying and exploiting variable frontiers in unoptimized clusterings appears to simplify a clustering substantially with no degradation in error rate. Of most interest here, however, is that optimized clusterings are simplified to a substantially greater extent than unoptimized clusterings with no degradation in error rate.

Thus far, we have focused an the criteria of error rate and simplicity, but in many applications, our real interest in simplicity stems from a broader interest in minimizing the expected cost of exploiting a clustering during classification: we expect that simpler clusterings have lower expected classification costs. We can view the various distinctions between unvalidated/validated and unoptimized/optimized clusterings in terms of expected classification cost. Table 8 shows some additional data obtained from our experiments with validation. In particular, the table shows:

Leaves (L): The mean number of leaves (over 20 trials) before and after validation (assuming the coarse pruning strategy described in Section 4.2) in both the optimized and unoptimized cases (copied from Table 7).
EPL: The mean total path length (over 20 trials). The total path length of an unvalidated tree, where each leaf corresponds to a single observation, is the sum of depths of leaves in the tree. In the case of a validated tree, where a leaf may cover multiple observations, the contribution of the leaf to the total path length is the depth of the leaf times the number of observations at that leaf.
Depth (D): The average depth of a leaf in the tree, which is computed by .
Breadth (B): The average branching factor of the tree. Given that , or for any m.
Cost (C): The expected cost of classifying an observation from the root to a leaf in terms of the number of nodes (clusters) examined during classification. At each level we examine each cluster and select the best. Thus, cost is the product of the number of levels and the number of clusters per level. So .

Table 8: Cost characteristics of unoptimized and optimized clustering before and after validation. Average and standard deviations over 20 trials. Characteristics that are ed are computed from the mean values of `Leaves' and EPL.

Table 8 illustrates that the expected cost of classification is less for optimized clusterings than for unoptimized clusterings in both the unvalidated and validated cases. However, these results should be taken with a grain of salt, and not simply because they are estimated values. In particular, we have expressed cost in terms of the expected number of nodes that would need to be examined during classification. An implicit assumption is that cost of examination is constant across nodes. In fact, the cost per examination roughly is constant (per domain) across nodes in our implementation and many others: at each node, all variables are examined. Consider that by this measure of cost, the least cost (unvalidated) clustering is one that splits the observations in thirds at each node, thus forming a balanced ternary tree, regardless of the form of structure in the data.

Of course, if such a tree does not reasonably capture structure in data, then we might expect this to be reflected in error rate and/or post-validation simplicity. Nonetheless, there are probably better measures of cost available. In particular, Gennari [1989] observed that when classifying an observation, evaluating the objective function over a proper subset of variables is often sufficient to categorize the observation relative to the same cluster that would have been selected if evaluation had occurred over all variables. Under ideal circumstances, when clusters of a partition are well separated (decoupled), testing a very few `critical' variables may be sufficient to advance classification.

Gennari implemented a focusing algorithm that sequentially evaluated the objective function over the variables, one additional variable at a time from most to least `critical', until a categorization with respect to one of the clusters could be made unambiguously. Using Gennari's procedure, examination cost is not constant across nodes. Carnes and Fisher [Fisher, Xu, Carnes, Reich, Fenves, Chen, Shiavi, Biswas & Weinberg, 1993] adapted Gennari's procedure to good effect in a diagnosis task, where the intent was to minimize the number of probes necessary to diagnose a fault. While Gennari offers a principled focusing strategy that can be used in conjunction with an objective function, the general idea of focusing on selected features during classification can be traced back to UNIMEM [Lebowitz, 1982; 1987] and CYRUS [Kolodner, 1983].

The results of Table 8 illustrate the form of an expected classification-cost analysis, but we might have also measured cost as time directly using a test set. In fact, comparisons between the time requirements of sorting in the random and similarity ordering conditions of Tables 4 and 5 suggest cost differences between good and poor clusterings in terms of time as well. Regardless of the form of analysis, however, it seems desirable that one express branching factor and cost in terms of the number of variables that need be tested assuming a focusing strategy such as Gennari's. It is likely that this will tend to make better distinctions between clusterings.

Next: Evaluating Objective Functions: Up: General Discussion Previous: General Discussion

JAIR, 4
Douglas H. Fisher
Sat Mar 30 11:37:23 CST 1996