Seed Selection, Reordering, and Reclustering

Next: Iterative Redistribution of Up: Iterative Optimization Previous: Iterative Optimization

Seed Selection, Reordering, and Reclustering

Michalski and Stepp's [Michalski & Stepp, 1983a] CLUSTER/2 seeks the optimal K-partitioning of data. The first step selects K random `seed' observations from the data. These seeds are `attractors' around which the K clusters are grown from the remaining data. Since seed selection can greatly impact clustering quality, CLUSTER/2 selects K new seeds that are `centroids' of the K initial clusters. Clustering is repeated with these new seeds. This process iterates until there is no further improvement in the quality of generated clusterings.

Ordering effects in sorting are related to effects that arise due to differing fixed-K seed selections: the initial observations in an ordering establish initial clusters that `attract' the remaining observations. In general, sorting performs better if the initial observations are from diverse areas of the observation-description space, since this facilitates the establishment of initial clusters that reflect these different areas. Fisher, Xu, and Zard [1992] showed that ordering data so that consecutive observations were dissimilar based on Euclidean distance led to good clusterings. Biswas et al. [1994] adapted this technique in their ITERATE system with similar results. In both cases, sorting used the PU score described previously.

This procedure presumes that observations that appear dissimilar by Euclidean distance tend to be placed in different clusters using the objective function. Taking the lead from CLUSTER/2, a measure-independent idea first sorts using a random data ordering, then extracts a biased `dissimilarity' ordering from the hierarchical clustering, and sorts again. The function of Table 1 outlines the reordering procedure. It recursively extracts a list of observations from the most probable (i.e., largest) cluster to the least probable, and then merges (i.e., interleaves) these lists, before exiting each recursive call -- at each step, an element from the most probable cluster is placed first, followed by an element of the second most probable, and so forth. Whatever measure guides clustering, observations in differing clusters have been judged dissimilar by the measure. Thus, this measure-independent procedure returns a measure-dependent dissimilarity ordering by placing observations from different clusters back-to-back.

Table 1: A procedure for creating a `dissimilarity' ordering of data.

Following initial sorting, we extract a dissimilarity ordering, recluster, and iterate, until there is no further improvement in clustering quality.

Next: Iterative Redistribution of Up: Iterative Optimization Previous: Iterative Optimization

JAIR, 4
Douglas H. Fisher
Sat Mar 30 11:37:23 CST 1996