Several authors [Fisher1987a; Cheeseman et al., 1988; Anderson & Matessa, 1991] motivate clustering as a means of improving performance on a task akin to pattern completion, where the error rate over completed patterns can be used to `externally' judge the utility of a clustering. Given a probablistic categorization tree of the type we have assumed, a new observation with an unknown value for a variable can be classified down the hierarchy using a small variation on the hierarchical sorting procedure described earlier. Classification is terminated at a selected node (cluster) along the classification path, and the variable value of highest probability at that cluster is predicted as the unknown variable value of the new observation. Naively, classification might always terminate at a leaf (i.e., an observation), and the leaf's value along the specified variable would be predicted as the variable value of the new observation. Our use of a simple resampling strategy known as holdout [Weiss & Kulikowski, 1991] is motivated by the fact that a variable might be better predicted at some internal node in the classification path. The identification of ideal-prediction frontiers for each variable suggests a pruning strategy for hierarchical clusterings.
Given a hierarchical clustering and a validation set of observations, the validation set is used to identify an appropriate frontier of clusters for prediction of each variable. Figure 4 illustrates that the preferred frontiers of any two variables may differ, and clusters within a frontier may be at different depths. For each variable, , the objects from the validation set are each classified through the hierarchical clustering with the value of variable `masked' for purposes of classification; at each cluster encountered during classification the observation's value for is compared to the most probable value for at the cluster; if they are the same, then the observation's value would have been correctly predicted at the cluster. A count of all such correct predictions for each variable at a cluster is maintained. Following classification for all variables over all observations of the validation set, a preferred frontier for each variable is identified that maximizes the number of correct counts for the variable. This is a simple, bottom-up procedure that insures that the number of correct counts at a node on the variable's frontier is greater than or equal to the sum of correct counts for the variable over each set of mutually-exclusive, collectively-exhaustive descendents of the node.
Figure 4: Frontiers for three variables in a hypothetical clustering.
From Fisher (1995). Figure reproduced with permission from
Proceedings of the First International Conference on Knowledge
Discovery in Data Mining, Copyright © 1995 American
Association for Artificial Intelligence.
Variable-specific frontiers enable a number of pruning strategies. For example, a node that lies below the frontier of every variable offers no apparent advantage in terms of pattern-completion error rate; such a node probably reflects no meaningful structure and it (and its descendents) may be pruned. However, if an analyst is focusing attention on a subset of the variables, then frontiers might be more flexibly exploited for pruning.
JAIR, 4