Subgroup Search and Evaluation in the ROC and TP/FP Space

Next: The Descriptive Induction Process Up: Subgroup Discovery: Rule Induction Previous: Rule Subset Selection

2.4 Subgroup Search and Evaluation in the ROC and TP/FP Space

The goal of this section is to clarify the relation between the ROC space which is usually used for evaluating classifier performance, and the TP/FP space which is being searched by the q_g heuristic in the SD algorithm.

Evaluation of induced subgroups in the ROC space (ROC: Receiver Operating Characteristic, Provost & Fawcett, 2001) shows their performance in terms of TPr and FPr, where TPr is the sensitivity of a classifier measuring the fraction of positive cases that are classified as positive, and FPr is the false alarm measuring the fraction of incorrectly classified negative cases: $TPr = \frac{TP}{TP + FN} = \frac{TP}{Pos}$ , and $FPr = \frac{FP}{TN + FP} = \frac{FP}{Neg}$ . A point in the ROC space shows classifier performance in terms of false alarm rate FPr (plotted on the X-axis) that should be as low as possible, and sensitivity TPr (plotted on the Y-axis) that should be as high as possible (see Figure 5 in Section 3.2).

The ROC space is appropriate for measuring the success of subgroup discovery, since subgroups whose TPr/FPr tradeoff is close to the diagonal can be discarded as uninteresting. Conversely, interesting rules/subgroups are those sufficiently distant from the diagonal. Those rules which are most distant from the diagonal define the points in the ROC space from which a convex hull is constructed. The area under the ROC curve defined by subgroups with the best TPr/FPr tradeoff can be used as a quality measure for comparing the success of different learners or subgroup miners. In subgroup construction, the data analyst can try to achieve the desired TPr/FPr tradeoff by building rules using different data mining algorithms, by different parameter settings of a selected data mining algorithm or by applying a cost-sensitive data mining algorithm that takes into the account different misclassification costs.

The q_g measure in the SD algorithm that needs to be maximized, tries to find subgroups that are as far as possible from the diagonal of the ROC space in the direcion of the left upper corner (with TPr equal to 100% and FPr equal to 0%). Note, however, that the actual computation, as implemented in Algorithm SD, is not performed in terms of TPr and FPr, as assumed in the ROC analysis, but rather in terms of TP and FP in the so-called TP/FP space. The reason is the improved computational efficiency of computing the q_g value which is used as a search heuristic for comparing the quality of rules for a given, fixed domain. For a fixed domain, the TP/FP space is as appropriate as the ROC space: the ROC space is namely equivalent to the normalized TP/FP space where Pos and Neg are normalization constants for Y and X axes, respectively. The TP/FP space and the ROC space are illustrated in Section 3.2 by Figures 4 and 5, respectively.

Next: The Descriptive Induction Process Up: Subgroup Discovery: Rule Induction Previous: Rule Subset Selection