Subgroup Discovery

Next: Measures of Interestingness Up: Related Work Previous: Related Work

5.1 Subgroup Discovery

The need for user interactivity in subgroup discovery is addressed by Wrobel S. et al. (1996), describing a system developed in the KESO European research project (Knowledge Extraction for Statistical Offices) and in the systems EXPLORA (Klösgen, 1996) and MIDOS (Wrobel, 1997, 2001). EXPLORA treats the learning task as a single relation problem, i.e., all the data are assumed to be available in one table (relation), whereas MIDOS extends this task to multi-relation databases, which is related to a number of other learning tasks (De Raedt & Dehaspe, 1997; Mannila & Toivonen, 1996; Wrobel & Dzeroski, 1995), mostly in the field of Inductive Logic Programming (Dzeroski & Lavrac, 2001; Lavrac & Dzeroski, 1994).

The most important features of EXPLORA and MIDOS, related to this paper, concern the use of heuristics for subgroup discovery; the measures of interestingness and the search heuristics are outlined in separate sections below. A related approach to our approach to rule subset selection, presented in Section 2.3, is Gebhardt's (1991) work on subgroup suppression.

Note that some approaches to association rule induction can also be used for subgroup discovery. For instance, the APRIORI-C algorithm (Jovanoski & Lavrac, 2001), which applies association rule induction to classification rule induction, outputs classification rules with guaranteed support and confidence with respect to a target class. If a rule satisfies also a user-defined significance threshold, an induced APRIORI-C rule is an independent `chunk' of knowledge about the target class, which can be viewed as a subgroup description with guaranteed significance, support and confidence. Similarly, the confirmation rule concept, introduced by (Gamberger & Lavrac, 2000) and used as a basis for the subgroup discovery algorithm in this paper, utilizes the minimal support requirement as a measure which must be satisfied by every rule in order to be included in the induced confirmation rule set.

Both above mentioned approaches to subgroup discovery exploit the information about class membership. One of the main reasons why these approaches are of interest for subgroup discovery is that, unlike the classical classification rule induction algorithms such as CN2 (Clark & Niblett, 1989) and AQ (Michalski, Mozetic, Hong, & Lavrac, 1986), they do not use the covering algorithm. In covering algorithms only the first few induced rules may be of interest as subgroup descriptors with sufficient coverage. Subsequently induced rules are induced from biased example subsets, e.g., subsets including only positive examples not covered by previously induced rules. This bias constrains the population for subgroup discovery in a way that is unnatural for the subgroup discovery process which is, in general, aimed at discovering interesting properties of subgroups of the entire population.

Recent approaches to subgroup discovery aim at overcoming the problem of this inappropriate bias of the standard covering algorithm. The recently developed subgroup discovery algorithms CN2-SD (Lavrac, Flach, Kavsek, & Todorovski, 2002) and RSD (Lavrac, Zelezný, & Flach, 2002) use the so-called weighted covering algorithm, similar to the one implemented in Algorithm DMS described in this paper.

Instance weights play an important role in boosting (Freund & Shapire, 1996) and alternating decision trees (Schapire & Singer, 1999). Instance weights have been used also in variants of the covering algorithm implemented in rule learning approaches such as SLIPPER (Cohen, 1999), RL (Lee, Buchanan, & Aronis, 1998) and DAIRY (Hsu, Soderland, & Etzioni, 1998). A variant of the weighted covering algorithm has been used also in the context of confirmation rule subset selection (Gamberger & Lavrac, 2000), used as a basis for the rule subset selection algorithm RSS described in this paper.

Next: Measures of Interestingness Up: Related Work Previous: Related Work