Results of Expert-Guided Subgroup Detection and Selection

The process of expert-guided subgroup discovery was performed as follows. For every data stage A, B and C, the DMS algorithm was run for values g in the range 0.5 to 100, and a fixed number of selected output rules equal to 3. The rules induced in this iterative process were shown to the expert for selection and interpretation. The inspection of 15-20 rules for each data stage triggered further experiments. Concrete suggestions of the medical expert involved in this study were to limit the number of features in the rule body and to try to avoid the generation of rules whose features would involve expensive and/or unreliable laboratory tests. Consequently, we have performed the further experiments by intentionally limiting the feature space and the number of iterations in the main loop of the SD algorithm (steps 2-12 of Algorithm SD).

In this iterative process, the expert has selected five interesting CHD risk groups. Table 1 shows the induced subgroups, together with the values of g and the rule significance. In the subgroup discovery terminology proposed in this paper, the features appearing in the conditions of rules describing the subgroups are called the principal factors. The described iterative process was successful for data at stages B and C, but it turned out that anamnestic data on its own (stage A data) is not informative enough for inducing subgroups, i.e., it failed to fulfil the expert's criteria of interestingness. Only after engineering the domain, by separating male and female patients, were interesting subgroups discovered. See Section 3.7 for more details on the expert's involvement in this subgroup discovery process.

	Expert Selected Subgroups			g	Sig
A1	CHD	$\leftarrow$	positive family history AND	14	95%
			age over 46 year
A2	CHD	$\leftarrow$	body mass index over 25 kgm^-2 AND	8	99%
			age over 63 years
B1	CHD	$\leftarrow$	total cholesterol over 6.1 mmolL^-1 AND	10	99.9%
			age over 53 years AND
			body mass index below 30 kgm^-2
B2	CHD	$\leftarrow$	total cholesterol over 5.6 mmolL^-1 AND	12	99.9%
			fibrinogen over 3.7 gL^-1 AND
			body mass index below 30 kgm^-2
C1	CHD	$\leftarrow$	left ventricular hypertrophy	10	99.9%

Table 1: Induced subgroups in the form of rules. Rule conditions are conjunctions of principal factors. Subgroup A1 is for male patients, subgroup A2 for female patients, while subgroups B1, B2, and C1 are for male and female patients. The subgroups are induced from different attribute subsets with corresponding g parameter values given in column g. The last column Sig contains information about the significance of the rules computed by the $\chi ^2$ test.

Separately for each data stage, we have investigated which of the induced rules are the best in terms of the ROC space, i.e., which of them are used to define the ROC convex hull. At stage B, for instance, seven rules are on the convex hull shown in Figures 4 and 5 for the TP/FP and the ROC space, respectively. Two of these rules, X1 and X2, indicated in the figures, are listed in Table 2. Notice that the expert-selected subgroups B1 and B2 are significant, but are not among those lying on the convex hull. The reason for selecting exactly those two rules at stage B are their simplicity (consisting of three features only), their generality (covering relatively many positive cases) and the fact that the used features are, from the medical point of view, inexpensive laboratory tests.

	Best Induced Subgroups			g	Sig
X1	CHD	$\leftarrow$	age over 61 years AND	4	99.9%
			tryglicerides below 1.85 mmolL^-1 AND
			high density lipoprotein below 1.25 mmolL^-1
X2	CHD	$\leftarrow$	body mass index over 25 AND	16	99.9%
			high density lipoprotein below 1.25 mmolL^-1 AND
			uric acid below 360 mmolL^-1 AND
			glucose below 7 mmolL^-1 AND
			fibrinogen over 3.7 gL^-1

Table 2: Two of the best induced subgroups induced for stage B. Their position in the TP/FP and the ROC space are marked in Figures 4 and 5, respectively.

$\epsfbox{jair2002_TPFP.eps}$

Figure 4: The TP/FP space presenting the convex hull of subgroups induced using the quality measure q_g = TP/(FP+g) at data stage B. Labels B1 and B2 denote positions of subgroups selected by the medical expert, and X1 and X2 two of the seven subgroups forming the TP/FP convex hull.

$\epsfbox{jair2002_ROC.eps}$

Figure 5: The same subgroups as in Figure 4 shown in the ROC space instead of the TP/FP space. The equivalence of these two spaces can be easily noticed. In the ROC space a thin line connecting points (0,0) and (100,100) represents rule positions with significance equal zero.

3.2 Results of Expert-Guided Subgroup Detection and Selection