This paper addresses the problem of subgroup discovery which can be defined as: given a population of individuals and a property of those individuals we are interested in, find population subgroups that are statistically `most interesting', e.g., are as large as possible and have the most unusual statistical (distributional) characteristics with respect to the property of interest (Klösgen, 1996; Wrobel, 1997, 2001). Its main contribution is a new methodology supporting the process of expert-guided subgroup discovery. Specifically, we introduce a novel parametrized definition of rule quality used in a heuristic beam search algorithm, a rule subset selection algorithm incorporating example weights, the detection of statistically significant properties of selected subgroups, and a novel subgroup visualization method. An in-depth analysis of the proposed quality measure is provided as well. The proposed methodology has been applied to the medical problem of detecting and describing patient groups with high risk for artherosclerotic coronary heart disease (CHD).1
The paper organization is as follows. Algorithms for subgroup detection and selection, which are the main ingredients of the expert-guided subgroup discovery methodology, are described in Section 2. Section 3 presents: the coronary heart disease risk group detection problem, the discovered patient risk groups, their statistical characterization, visualization, medical interpretation and evaluation, including a discussion on the expert's role in the subgroup discovery process. Section 4 provides an in-depth analysis of the proposed rule quality measure for subgroup discovery including an experimental comparison with a selected cost-based quality measure. Finally, Section 5 provides links to the related work.