Next: References Up: No Title Previous: Other Issues

Concluding Remarks

We have partitioned the search through the space of hierarchical clusterings into three phases. These phases, together with an opinion of their desirable characteristics from a data analysis standpoint, are (1) inexpensive generation of an initial clustering that suggests the form of structure in data (or its absence), (2) iterative optimization (perhaps in background) for clusterings of better quality, and (3) retrospective simplification of generated clusterings. We have evaluated three iterative optimization strategies that operate independent of objective function. All of these, to varying degrees, are inspired by previous research, but hierarchical redistribution appears novel as an iterative optimization technique for clustering; it also appears to do quite well.

Another novel aspect of this work is the use of resampling as a means of validating clusters and of simplifying hierarchical clusterings. The experiments of Section 5 indicate that optimized clusterings provide greater data compression than do unoptimized clusterings. This is not surprising, given that PU compresses data in some reasonable manner; whether it does so `optimally' though is another issue.

We have made several recommendations for further research.

We have suggested experiments with alternative objective functions, including Bayesian and MML measures, and some that are inspired by variable-selection measures of decision tree induction.
There may be cost and quality benefits to applying optimization strategies at intermittent points during hierarchical sorting.
The holdout method of identifying variable frontiers and pruning suggests a strategy akin to n-fold-cross validation that clusters over all the data, while still identifying variable frontiers and facilitating pruning.
Analyses of classification cost for purposes of external validation are probably best expressed in terms of the expected number of variables using a focusing method such as Gennari's.

In sum, this paper has proposed criteria for internal and external validation, and has made experimental comparisons between various approaches along these dimensions. Ideally, as researchers explore other objective functions, search control strategies, and pruning techniques, the same kind of experimental comparisons (particularly along external criteria such as error rate, simplicity, and classification cost) that are de rigueur in comparisons of supervised systems, will become more prominent in unsupervised contexts.

Next: References Up: No Title Previous: Other Issues

JAIR, 4
Douglas H. Fisher
Sat Mar 30 11:37:23 CST 1996