Statistical Characterization of Subgroups

The next step in the proposed descriptive induction process starts from the discovered subgroups. In this step, statistical differences in distributions are computed for two populations, the target and the reference population. The target population consists of true positive case (CHD patients included into the analyzed subgroup), whereas the reference population are all available non-target class examples (all the healthy subjects).

Statistical differences in distributions for all the descriptors (attributes) between these two populations is tested using the $\chi ^2$ test with 95% confidence stage (p =0.05). For this purpose numerical attributes have been partitioned in up to 30 intervals so that in every interval there are at least 5 instances. Among the attributes with significantly different distributions there are always those that form the features describing the subgroups (the principal factors), but usually there are also other attributes with significantly different value distributions. These attributes are called supporting attributes, and the features formed of their values that are characteristic for the discovered subgroups are called supporting factors.

Supporting factors are very important to achieve pattern descriptions that are reasonably complete and acceptable for medical practice, as medical experts dislike short rules and prefer rules including as much supportive evidence as possible (Kononenko, 1993).

In this work, the role of statistical analysis is to detect meaningful supporting factors, whereas the decision whether they will be used to support user's confidence in the subgroup description is left to the expert. In the CHD application the expert has decided whether the proposed factors are indeed interesting, how reliable they are or how easily they can be measured in practice. In Table 3, expert selected supporting factors are listed next to the individual CHD risk groups, each described by a list of principal factors.

	Principal Factors	Supporting Factors
A1	positive family history	psychosocial stress
	age over 46 year	cigarette smoking
		hypertension
		overweight
A2	body mass index over 25 kgm^-2	positive family history
	age over 63 years	hypertension
		slightly increased LDL cholesterol
		normal but decreased HDL cholesterol
B1	total cholesterol over 6.1 mmolL^-1	increased triglycerides value
	age over 53 years
	body mass index below 30 kgm^-2
B2	total cholesterol over 5.6 mmolL^-1	positive family history
	fibrinogen over 3.7 mmolL^-1
	body mass index below 30 kgm^-2
C1	left ventricular hypertrophy	positive family history
		hypertension
		diabetes mellitus

Table 3: Induced subgroup descriptions (principal factors) and their statistical characterizations (supporting factors).

3.3 Statistical Characterization of Subgroups