The Subgroup Discovery Algorithm

The goal of the subgroup discovery algorithm SD, outlined in Figure 1, is to search for rules that maximize $q_{g} = \frac{TP}{FP + g}$ , where TP are true positives, FP are false positives, and g is a generalization parameter. High quality rules cover many target class examples and a low number of non-target examples. The number of tolerated non-target class cases, relative to the number of covered target class cases, is determined by parameter g. For low g ( $g \leq 1$ ), induced rules will have high specificity (low false alarm rate) since covering of every single non-target class example is made relatively very `expensive'. On the other hand, by selecting a high g value (g > 10 for small domains), more general rules will be generated, covering also non-target class instances.

$\begin{figure*} {\small \begin{tabbing} 12345 \= 123\= 123\= 123\= 123\= 123\= 1... ...ftarrow New\_Beam$\ \\ (12)\> {\bf end while} \\ \end{tabbing}}\end{figure*}$

Figure 1: Heuristic beam search rule construction algorithm for subgroup discovery.

Varying the value of g enables the expert to guide subgroup discovery in the TP/FP space, in which FP (plotted on the X-axis) needs to be minimized, and TP (plotted on the Y-axis) needs to be maximized. The TP/FP space is similar to the ROC (Receiver Operating Characteristic) space (Provost & Fawcett, 2001). The comparison of the ROC and TP/FP space and the g_q heuristic are analyzed in detail in Sections 2.4 and 4, respectively.

Algorithm SD takes as its input the complete training set E and the feature set L, where features $l \in L$ are logical conditions constructed from attribute values describing the examples in E. For discrete (categorical) attributes, features have the form Attribute = value or $Attribute \neq value$ , for numerical attributes they have the form Attribute > value or Attribute < value. To formalize feature construction, let values v_ix (x = 1..k_ip) denote the k_ip different values of attribute A_i that appear in the positive examples and w_iy (y = 1..k_in) the k_in different values of A_i appearing in the negative examples. A set of features L is constructed as follows:

There is no theoretical upper value for the user-refined g parameter, but in practice the suggested upper limit should not exceed the number of training examples. For instance, suggested g values in the Data Mining Server are in the range between 0.1 and 100, for analysing data sets of up to 250 examples. The choice of g should be adjusted both to the size of the data set and to the proportion of positive examples in the set.

Algorithm SD has two additional parameters which are typically not adjusted by the user. The first is $min\_support$ (default value is $\sqrt{P}/E$ , where P is the number of target class examples in E) which indirectly defines the minimal number of target class examples which must be covered by every subgroup. The second is $beam\_width$ (default value is 20) which defines the number of solutions kept in each iteration. The output of the algorithm is set S of $beam\_width$ different rules with highest q_g values. The rules have the form of conjunctions of features from L.

The algorithm initializes all the rules in Beam and $New\_beam$ by empty rule conditions. Their quality values q_g(i) are set to zero (step 1). Rule initialization is followed by an infinite loop (steps 2-12) that stops when, for all rules in the beam, it is no longer possible to further improve their quality. Rules can be improved only by conjunctively adding features from L. After the first iteration, a rule condition consists of a single feature, after the second iteration up to two features, and so forth. The search is systematic in the sense that for all rules in the beam (step 3) all features from L (step 4) are tested in each iteration. For every new rule, constructed by conjunctively adding a feature to rule body (step 5) quality q_g is computed (step 6). If the support of the new rule is greater than $min\_support$ and if its quality q_g is greater than the quality of any rule in $New\_beam$ , the worst rule in $New\_beam$ is replaced by the new rule. The rules are reordered in $New\_beam$ according to their quality q_g. At the end of each iteration, $New\_beam$ is copied into Beam (step 11). When the algorithm terminates, the first rule in Beam is the rule with maximum q_g.

A necessary condition (in step 7) for a rule to be included in $New\_beam$ is that it must be relevant. The new rule is irrelevant if there exists a rule R in $New\_beam$ such that true positives of the new rule are a subset of true positives of R and false positives of the new rule are a superset of false positives of R. A detailed analysis of relevance, presented by Lavracetall1998, is out of the main scope of this paper. After the new rule is included in $New\_beam$ it may happen that some of the existing rules in $New\_beam$ become irrelevant with respect to this new rule. Such rules are eliminated from $New\_beam$ during its reordering (in step 8). The testing of relevance ensures that $New\_beam$ contains only different and relevant rules.

In Algorithm SD, rule quality measure q_g serves two purposes: first, rule evaluation, and second, evaluation of features and their conjunctions with high potential for the construction of high quality rules in subsequent iterations. The analysis of this quality measure in Section 4 shows that for the first purpose, a measure assigning different costs to false positives and false negatives could perform equally well, but for the purpose of guiding the search the q_g measure is advantageous.

2.2 The Subgroup Discovery Algorithm