The next step in the proposed descriptive induction process starts from the discovered subgroups. In this step, statistical differences in distributions are computed for two populations, the target and the reference population. The target population consists of true positive case (CHD patients included into the analyzed subgroup), whereas the reference population are all available non-target class examples (all the healthy subjects).
Statistical differences in distributions for all the descriptors (attributes) between these two populations is tested using the test with 95% confidence stage (p =0.05). For this purpose numerical attributes have been partitioned in up to 30 intervals so that in every interval there are at least 5 instances. Among the attributes with significantly different distributions there are always those that form the features describing the subgroups (the principal factors), but usually there are also other attributes with significantly different value distributions. These attributes are called supporting attributes, and the features formed of their values that are characteristic for the discovered subgroups are called supporting factors.
Supporting factors are very important to achieve pattern descriptions that are reasonably complete and acceptable for medical practice, as medical experts dislike short rules and prefer rules including as much supportive evidence as possible (Kononenko, 1993).
In this work, the role of statistical analysis is to detect meaningful supporting factors, whereas the decision whether they will be used to support user's confidence in the subgroup description is left to the expert. In the CHD application the expert has decided whether the proposed factors are indeed interesting, how reliable they are or how easily they can be measured in practice. In Table 3, expert selected supporting factors are listed next to the individual CHD risk groups, each described by a list of principal factors.
|