In this section, we evaluate the architectural components of our algorithm using degradation (ablation) studies. We perform experiments without each component in turn, and then with none of them, to observe the impact on the system's performance. Such studies have been useful in developing practical methods for other kinds of anaphora resolution as well (see, for example, [24]). Specifically, an experiment was performed testing each of the following variations.
Recall that all rules are applied to each utterance, and each rule that matches produces a Partial-Augmented-ILT (which is assigned the certainty factor of the rule). All maximal mergings of the Partial-Augmented-ILTs are then formed, to create a set of Augmented-ILTs. Then, the final interpretation of the utterance is chosen from among the set of Augmented-ILTs. The certainty factor of each Augmented-ILT is the sum of the certainty factors of the Partial-Augmented-ILTs composing it. Thus, setting the certainty factors to 1 implements the scheme in which the more partial results are merged into an interpretation, the higher the overall certainty factor of that interpretation. In other words, this scheme favors the Augmented-ILT resulting from the greatest number of rule applications.
This scheme is essentially random selection among the Augmented-ILTs that make sense according to the critics. If the critics did not exist, then setting the rule certainty factors to 0 would result in random selection. With the critics, any Augmented-ILTs to which the critics apply are excluded from consideration, because the critics will lower their certainty factors to negative numbers.
That is, the Partial-Augmented-ILTs are not merged prior to selection of the final Augmented-ILT. The effect of this is that the result of one single rule is chosen to be the final interpretation.
In this case, the certainty factors for rules that access the focus list are not adjusted based on how far back the chosen focus list item is.
Specifically, neither the critics nor the distance factors are used, no merging of partial results is performed, and the rules are all given the same certainty factor (namely, 1).
Table 10 shows the results for each variation when run over the unambiguous but uncorrected CMU training data. For comparison, the first row shows the results for the system as normally configured. As with the previous evaluations, accuracy is the percentage of the correct answers the system produces, while precision is the percentage of the system's answers that are correct.
Only two of the differences are statistically significant ( ), namely, the precision of the system's performance when the critics are not used, and the accuracy of the system's performance when all of the certainty factors are 0. The significance analysis was performed using paired t-tests comparing the results for each variation with the results for the system as normally configured.
The performance difference when the critics are not used is due to extraneous alternatives that the critics would have weeded out. The drop in accuracy when the certainty factors are all 0 shows that the certainty factors have some effect. Experimenting with statistical methods to derive them would likely lead to further improvement.
The remaining figures are all only slightly lower than those for the full system, and are all much higher than the baseline accuracies.
It is interesting to note that the unimportance of the distance factors (variation 5) is consistent with the findings presented in Section 8.1 that the last mentioned time is an acceptable antecedent in the vast majority of cases. Otherwise, we might have expected to see an improvement in variation 5, since the distance factors penalize going further back on the focus list.