We assume that an observation is a vector of nominal
values, along distinct variables,
.
A measure of
category utility [Gluck & Corter,
1985; Corter & Gluck, 1992],
and/or variants
have been used extensively by a system known as COBWEB
[Fisher, 1987a]
and many related systems [Gennari,
Langley & Fisher, 1989; McKusick &
Thompson, 1990; Iba & Gennari,
1991; McKusick &
Langley, 1991; Reich & Fenves,
1991; Biswas, Weinberg & Li,
1994; De Alte Da Veiga, 1994;
Kilander, 1994; Ketterlin, Gancarski & Korczak,
1995].
This measure rewards clusters, , that increase the
predictability of variable values within
(i.e.,
)
relative to
their predictability in the population as a whole (i.e.,
).
By favoring clusters that increase predictability (i.e.,
), we also necessarily
favor clusters that increase variable value predictiveness
(i.e.,
).
Clusters for which many variable values are predictable are cohesive. Increases in predictability stem from the shared variable values of observations within a cluster. A cluster is well-separated or decoupled from other clusters if many variable values are predictive of the cluster. High predictiveness stems from the differences in the variable values shared by members of one cluster from those shared by observations of another cluster. A general principle of clustering is to increase the similarity of observations within clusters (i.e., cohesion) and to decrease the similarity of observations across clusters (i.e., coupling).
Category utility is similar in form to the Gini Index, which has
been used in supervised systems that construct decision trees
[Mingers, 1989b; Weiss &
Kulikowski, 1991].
The Gini Index is typically intended to address the issue
of how well the values of a variable, , predict a priori
known class labels in a supervised context. The summation over
Gini Indices reflected in CU addresses the extent that a
cluster predicts the values of all the variables.
CU rewards clusters,
, that most reduce
a collective impurity over all variables.
In Fisher's [1987a] COBWEB system,
CU is used to measure the quality
of a partition of data,
or the average category utility of clusters in the partition.
Sections 3.5 and 5.2 note some
nonoptimalities with this measure of partition quality,
and suggest some alternatives. Nonetheless, this measure is
commonly used, we will take this opportunity to note
its problems, and none of the techniques that we describe
is tied to this measure.
JAIR, 4