This paper presents a novel subgroup discovery algorithm integrated into the end to end knowledge discovery process. The discussion and empirical results point out the importance of effective expert-guided subgroup discovery in the TP/FP space. Its main advantages are the possibility to induce knowledge at different levels of generalization (achieved by tuning the g parameter of the subgroup discovery algorithm) used in the rule quality measure that ensures the induction of high quality rules also in the heuristic subgroup discovery process. The paper argues that expert's involvement in the induction process is necessary for successful actionable knowledge generation.
The proposed expert-guided subgroup discovery process consists of the following steps: problem understanding, data understanding and preparation, subgroup discovery, subgroup subset selection, statistical characterization of subgroups, subgroup visualization, their interpretation and evaluation. The main steps, described in detail in this paper, are subgroup discovery and the selection of a subset of diverse subgroups, followed by the statistical characterization of subgroups that adds supporting factors to the induced subgroup descriptions. Supporting factors represent redundant information about subgroups, but, in our opinion, their function is extremely important in pattern description, because they help the experts to obtain a more complete characterization and better understanding of subgroups. Moreover, they increase the expert's confidence that the pattern is appropriate for the problem that he is trying to solve. In addition, subgroup visualization helps in understanding the relationships among patterns and gives visual insights into their sensitivity and false alarm rate.
The presented approach to descriptive induction uses expert knowledge at every step. Our intention was not to build a system that will replace experts but rather to provide a methodology that will help experts in the knowledge discovery process. In our view, the possibility of guiding the induction process is an advantage of this approach.