Detailed results for the test sets are presented in this section, starting with results for the CMU data (see Table 2). Accuracy measures the extent to which the system produces the correct answer, while precision measures the extent to which the system's answers are correct (see the formulas in Table 2). For each component of the extracted temporal structure, the system's correct and incorrect answers were counted. Since null values occur quite often, these counts exclude cases in which the system's answer, the correct answer, or both answers are null. Those cases were counted separately. Note that each test set contains three complete dialogs with an average of 72 utterances per dialog.
These results show that the system achieves an overall accuracy of 81%, which is significantly better than the baseline accuracy (defined below) of 43%. In addition, the results show a high precision of 92%. In some of the individual cases, however, the results could be higher due to several factors. For example, our system development was inevitably focused more on some fields than others. An obvious area for improvement is the system's processing of the time of day fields. Also, note that the values in the Mis column are higher than those in the Ext column. This reflects the conservative coding convention, mentioned in Section 3, for filling in unspecified end points.
The accuracy and precision figures for the hour & minute and time of day fields are very high because a large proportion of them are null. We include null correct answers in our figures because such answers often reflect valid decisions not to fill in explicit values from previous Temporal Units.
Table 3 contains the results for the system on the NMSU data. It shows that the system performs respectably, with 69% accuracy and 88% precision, on the more complex set of data. The precision is still comparable, but the accuracy is lower, since more of the entries are left unspecified (that is, the figures in the Mis column in Table 3 are higher than in Table 2). Furthermore, the baseline accuracy (29%) is almost 15% lower than the one for the CMU data (43%), supporting the claim that this data set is more challenging.
|
The baseline accuracies for the test data sets are shown in Table 4. These values were derived by disabling all the rules and evaluating the input itself (after performing normalization, so that the evaluation software could be applied). Since null values are the most frequent for all fields, this is equivalent to using a naive algorithm that selects the most frequent value for each field. Note that in Tables 2 and 3, the baseline accuracies for the end month, date, and day of week fields are quite low because the coding convention calls for filling in these fields, even though they are not usually explicitly specified. In this case, an alternative baseline would have been to use the corresponding starting field. This has not been calculated, but the results can be approximated by using the baseline figures for the starting fields.
The rightmost column of Table 4 shows that there is a small amount of error in the input representation. This figure is 1 minus the precision of the input representation (after normalization). Note, however, that this is a close but not exact measure of the error in the input, because there are a few cases of the normalization process committing errors and a few of it correcting errors. Recall that the input is ambiguous; the figures in Table 4 are based on the system selecting the first ILT in each case. Since the parser orders the ILTs based on a measure of acceptability, this choice is likely to have the relevant temporal information.
The above results are for the system taking ambiguous semantic representations as input. To help isolate errors due to our model, the system was also evaluated on unambiguous, partially corrected input for all the seen data (the test sets were retained as unseen test data). The input is only partially corrected because some errors are not feasible to correct manually, given the complexity of the input representation.
The overall results are shown in the Table 5. The table
includes the results presented earlier in Tables
2 and 3, to facilitate comparison.
In the CMU data set, there are twelve dialogs in
the training data and three dialogs in a held out test set. The
average length of each dialog is approximately 65 utterances. In
the NMSU data set, there are four training dialogs
and three test dialogs.
|
In both data sets, there are noticeable gains in performance on the seen data going from ambiguous to unambiguous input, especially for the NMSU data. Therefore, the semantic ambiguity and input errors contribute significantly to the system's errors.
Some challenging characteristics of the seen NMSU data are vast semantic ambiguity, numbers mistaken by the input parser for dates (for example, phone numbers are treated as dates), and the occurrences of subdialogs.
Most of the the system's errors on the unambiguous data are due to parser error, errors in applying the rules, errors in mistaking anaphoric references for deictic references (and vice versa), and errors in choosing the wrong anaphoric relation. As will be shown in Section 8.1, our approach handles focus effectively, so few errors can be attributed to the wrong entities being in focus.