To test the validation procedure's promise for simplifying hierarchical clusterings, each of the data sets used in the optimization experiments of Section 3.4 was randomly divided into three subsets: 40% for training, 40% for validation, and 20% for test. A hierarchical clustering is first constructed by sorting the training set in randomized order. This hierarchy is then optimized using iterative hierarchical redistribution. Actually, because of cost considerations, a hierarchy is constructed several levels at a time. The hierarchy is initially constructed to height 4, where the deepest level is the set of training observations. This hierarchy is optimized using hierarchical redistribution. Clusters at the bottommost level (i.e., 4) are removed as children of level 3 clusters, and the subset of training observations covered by each cluster of level 3 is hierarchically sorted to a height 4 tree and optimized. The roots of these subordinate clusterings are then substituted for each cluster at depth 3 in the original tree. The process is repeated on clusters at level 3 of the subordinate trees and subsequent trees thereafter until no further decomposition is possible. The final hierarchy, which is not of constant-bounded height, decomposes the entire training set to singleton clusters, each containing a single training observation. The validation set is then used to identify variable frontiers within the entire hierarchy.
During testing of a validated clustering, each variable of each test observation is masked in turn; when classification reaches a cluster on the frontier of the masked variable, the most probable value is predicted as the value of the observation; the proportion of correct predictions for each variable over the test set is recorded. For comparative purposes, we also use the test set to evaluate predictions stemming from the unvalidated tree, where all variable predictions are made at the leaves (singleton clusters) of this tree.
Table 6: Characteristics of optimized clusterings
before and after validation.
Average and standard deviations over 20 trials.
Table 6 shows results from 20 experimental trials using optimized, unvalidated and validated clusterings generated as just described from random orderings. The first row of each domain lists the average number of leaves (over the 20 experimental trials) for the unvalidated and validated trees. The unvalidated clusterings decompose the training data to single-observation leaves -- the number of leaves equals the number of training observations. In the validated clustering, we assume that clusters are pruned if they lie below the frontiers of all variables. Thus, a leaf in a validated clustering is a cluster (in the original clustering) that is on the frontier of at least one variable, and none of its descendent clusters (in the original clustering) are on the frontier of any variable. For example, if we assume that the tree of Figure 4 covers data described only in terms of variables , , and , then the number of leaves in this validated clustering would be 7.
Prediction accuracies in the second row of each domain entry are the mean proportion of correct predictions over all variables over 20 trials. Predictions were generated at leaves (singleton clusters) in the unvalidated hierarchical clusterings and at appropriate variable frontiers in the validated clusterings. In all cases, validation/pruning substantially reduces clustering size and it does not diminish accuracy.
The number of leaves in the validated case, as we have described it, assumes a very coarse pruning strategy; it will not necessarily discriminate a clustering with uniformly deep frontiers from one with a single or very few deep frontiers. We have suggested that more flexible pruning or `attention' strategies might be possible when an analyst is focusing on one or a few variables. We will not specify such strategies, but the statistic given in row 3 of each domain entry suggests that clusterings can be rendered in considerably simpler forms when an analyst's attention is selective. Row 3 is the average number of frontier clusters per variable. This is an average over all variables and all experimental trials. In the validated tree of Figure 4 the average frontier size is .
Intuitively, a frontier cluster of a variable is a `leaf' as far as prediction of that variable is concerned. The `frontier size' for unvalidated clusterings is simply given by the number of leaves, since this is where all variable predictions are made in the unvalidated case. Our results suggest that when attention is selective, a partial clustering that captures the structure involving selected variables can be presented to an analyst in very simplified form.
JAIR, 4