Evaluation of the Architectural Components

In this section, we evaluate the architectural components of our algorithm using degradation (ablation) studies. We perform experiments without each component in turn, and then with none of them, to observe the impact on the system's performance. Such studies have been useful in developing practical methods for other kinds of anaphora resolution as well (see, for example, [24]). Specifically, an experiment was performed testing each of the following variations.

Table 10 shows the results for each variation when run over the unambiguous but uncorrected CMU training data. For comparison, the first row shows the results for the system as normally configured. As with the previous evaluations, accuracy is the percentage of the correct answers the system produces, while precision is the percentage of the system's answers that are correct.

Table 10: Evaluation of the Variations on CMU Unambiguous/Uncorrected Data

Variation	Cor	Inc	Mis	Ext	Nul	Act	Poss	Acc	Prec
system as is	1283	44	112	37	574	1938	2013	0.923	0.958
all CFs 1.0	1261	77	101	50	561	1949	2000	0.911	0.935
all CFs 0.0	1202	118	119	49	562	1931	2001	0.882	0.914
-critics	1228	104	107	354	667	2353	2106	0.900	0.805
-dist. factors	1265	52	122	50	591	1958	2030	0.914	0.948
-merge	1277	46	116	54	577	1954	2016	0.920	0.949
combo	1270	53	116	67	594	1984	2033	0.917	0.940

Legend
Cor(rect):	System and key agree on non-null value
Inc(orrect):	System and key differ on non-null value
Mis(sing):	System has null value for non-null key
Ext(ra):	System has non-null value for null key
Nul(l):	Both system and key give null answer

Poss(ible):	Correct + Incorrect + Missing + Null
Act(ual):	Correct + Incorrect + Extra + Null
Base(line)Acc(uracy):	Baseline accuracy (input used as is)
Acc(uracy):	% Key values matched correctly ((Correct + Null)/Possible)
Prec(ision):	% System answers matching the key ((Correct + Null)/Actual)

Only two of the differences are statistically significant ( $p \le 0.05$ ), namely, the precision of the system's performance when the critics are not used, and the accuracy of the system's performance when all of the certainty factors are 0. The significance analysis was performed using paired t-tests comparing the results for each variation with the results for the system as normally configured.

The performance difference when the critics are not used is due to extraneous alternatives that the critics would have weeded out. The drop in accuracy when the certainty factors are all 0 shows that the certainty factors have some effect. Experimenting with statistical methods to derive them would likely lead to further improvement.

The remaining figures are all only slightly lower than those for the full system, and are all much higher than the baseline accuracies.

It is interesting to note that the unimportance of the distance factors (variation 5) is consistent with the findings presented in Section 8.1 that the last mentioned time is an acceptable antecedent in the vast majority of cases. Otherwise, we might have expected to see an improvement in variation 5, since the distance factors penalize going further back on the focus list.