next up previous
Next: Other Work on Temporal Up: An Empirical Approach to Previous: Temporal Reference Resolution Rules

   
Results

As mentioned in Section 3, the main results are based on comparisons against human annotation of the held out test data. The results are based on straight field-by-field comparisons of the Temporal Unit representations introduced in Section 3. To be considered correct, information must not only be right, but it also has to be in the right place. Thus, for example, ``Monday'' correctly resolved to Monday 19 August, but incorrectly treated as a starting rather than an end time, contributes 3 errors of omission and 3 errors of commission (and receives no credit for the correct date).

Detailed results for the test sets are presented in this section, starting with results for the CMU data (see Table 2). Accuracy measures the extent to which the system produces the correct answer, while precision measures the extent to which the system's answers are correct (see the formulas in Table 2). For each component of the extracted temporal structure, the system's correct and incorrect answers were counted. Since null values occur quite often, these counts exclude cases in which the system's answer, the correct answer, or both answers are null. Those cases were counted separately. Note that each test set contains three complete dialogs with an average of 72 utterances per dialog.


 
Table 2: Evaluation of System on CMU Test Data
Label Cor Inc Mis Ext Nul Poss Act BaseAcc Acc Prec
start                    
Month 49 3 7 3 0 59 55 0.338 0.831 0.891
Date 48 4 7 3 0 59 55 0.403 0.814 0.873
DayofWeek 46 6 7 3 0 59 55 0.242 0.780 0.836
HourMin 18 0 7 0 37 62 55 0.859 0.887 1.000
TimeDay 9 0 18 0 35 62 44 0.615 0.710 1.000
end                    
Month 48 3 7 1 3 61 55 0.077 0.836 0.927
Date 47 5 6 3 1 59 56 0.048 0.814 0.857
DayofWeek 45 7 6 3 1 59 56 0.077 0.780 0.821
HourMin 9 0 9 0 44 62 53 0.862 0.855 1.000
TimeDay 4 0 13 1 44 61 49 0.738 0.787 0.980
Overall 323 28 87 17 165 534 604 0.428 0.809 0.916

Legend  
Cor(rect): System and key agree on non-null value
Inc(orrect): System and key differ on non-null value
Mis(sing): System has null value for non-null key
Ext(ra): System has non-null value for null key
Nul(l): Both system and key give null answer
   
Poss(ible): Correct + Incorrect + Missing + Null
Act(ual): Correct + Incorrect + Extra + Null
Base(line)Acc(uracy): Baseline accuracy (input used as is)
Acc(uracy): % Key values matched correctly ((Correct + Null)/Possible)
Prec(ision): % System answers matching the key ((Correct + Null)/Actual)
 

These results show that the system achieves an overall accuracy of 81%, which is significantly better than the baseline accuracy (defined below) of 43%. In addition, the results show a high precision of 92%. In some of the individual cases, however, the results could be higher due to several factors. For example, our system development was inevitably focused more on some fields than others. An obvious area for improvement is the system's processing of the time of day fields. Also, note that the values in the Mis column are higher than those in the Ext column. This reflects the conservative coding convention, mentioned in Section 3, for filling in unspecified end points.

The accuracy and precision figures for the hour & minute and time of day fields are very high because a large proportion of them are null. We include null correct answers in our figures because such answers often reflect valid decisions not to fill in explicit values from previous Temporal Units.

Table 3 contains the results for the system on the NMSU data. It shows that the system performs respectably, with 69% accuracy and 88% precision, on the more complex set of data. The precision is still comparable, but the accuracy is lower, since more of the entries are left unspecified (that is, the figures in the Mis column in Table 3 are higher than in Table 2). Furthermore, the baseline accuracy (29%) is almost 15% lower than the one for the CMU data (43%), supporting the claim that this data set is more challenging.


 
Table 3: Evaluation of System on NMSU Test Data
Label Cor Inc Mis Ext Nul Poss Act BaseAcc Acc Prec
start                    
TimeDay 9 0 18 0 35 62 44 0.615 0.710 1.000
Month 55 0 23 5 3 63 81 0.060 0.716 0.921
Date 49 6 23 5 3 63 81 0.060 0.642 0.825
DayofWeek 52 3 23 5 3 63 81 0.085 0.679 0.873
HourMin 34 3 7 6 36 79 80 0.852 0.875 0.886
TimeDay 18 8 31 2 27 55 84 0.354 0.536 0.818
end                    
Month 55 0 23 5 3 63 81 0.060 0.716 0.921
Date 49 6 23 5 3 63 81 0.060 0.642 0.825
DayofWeek 52 3 23 5 3 63 81 0.060 0.679 0.873
HourMin 28 2 13 1 42 73 85 0.795 0.824 0.959
TimeDay 9 2 32 5 38 54 81 0.482 0.580 0.870
Overall 401 33 221 44 161 639 816 0.286 0.689 0.879
 

The baseline accuracies for the test data sets are shown in Table 4. These values were derived by disabling all the rules and evaluating the input itself (after performing normalization, so that the evaluation software could be applied). Since null values are the most frequent for all fields, this is equivalent to using a naive algorithm that selects the most frequent value for each field. Note that in Tables 2 and 3, the baseline accuracies for the end month, date, and day of week fields are quite low because the coding convention calls for filling in these fields, even though they are not usually explicitly specified. In this case, an alternative baseline would have been to use the corresponding starting field. This has not been calculated, but the results can be approximated by using the baseline figures for the starting fields.

The rightmost column of Table 4 shows that there is a small amount of error in the input representation. This figure is 1 minus the precision of the input representation (after normalization). Note, however, that this is a close but not exact measure of the error in the input, because there are a few cases of the normalization process committing errors and a few of it correcting errors. Recall that the input is ambiguous; the figures in Table 4 are based on the system selecting the first ILT in each case. Since the parser orders the ILTs based on a measure of acceptability, this choice is likely to have the relevant temporal information.


 
Table 4: Baseline Figures for Both Test Sets
Set Cor Inc Mis Ext Nul Act Poss Acc Input Error
cmu 84 6 360 10 190 290 640 0.428 0.055
nmsu 65 3 587 4 171 243 826 0.286 0.029
 

The above results are for the system taking ambiguous semantic representations as input. To help isolate errors due to our model, the system was also evaluated on unambiguous, partially corrected input for all the seen data (the test sets were retained as unseen test data). The input is only partially corrected because some errors are not feasible to correct manually, given the complexity of the input representation.

The overall results are shown in the Table 5. The table includes the results presented earlier in Tables 2 and 3, to facilitate comparison. In the CMU data set, there are twelve dialogs in the training data and three dialogs in a held out test set. The average length of each dialog is approximately 65 utterances. In the NMSU data set, there are four training dialogs and three test dialogs.

 
Table 5: Overall Results
seen/ cmu/ ambiguous, uncorrected/ Dialogs Utterances Acc Prec
unseen nmsu unambiguous, partially corrected        
seen cmu ambiguous, uncorrected 12 659 0.883 0.918
seen cmu unambiguous, partially corrected 12 659 0.914 0.957
unseen cmu ambiguous, uncorrected 3 193 0.809 0.916
seen nmsu ambiguous, uncorrected 4 358 0.679 0.746
seen nmsu unambiguous, partially corrected 4 358 0.779 0.850
unseen nmsu ambiguous, uncorrected 3 236 0.689 0.879
           

 

In both data sets, there are noticeable gains in performance on the seen data going from ambiguous to unambiguous input, especially for the NMSU data. Therefore, the semantic ambiguity and input errors contribute significantly to the system's errors.

Some challenging characteristics of the seen NMSU data are vast semantic ambiguity, numbers mistaken by the input parser for dates (for example, phone numbers are treated as dates), and the occurrences of subdialogs.

Most of the the system's errors on the unambiguous data are due to parser error, errors in applying the rules, errors in mistaking anaphoric references for deictic references (and vice versa), and errors in choosing the wrong anaphoric relation. As will be shown in Section 8.1, our approach handles focus effectively, so few errors can be attributed to the wrong entities being in focus.


next up previous
Next: Other Work on Temporal Up: An Empirical Approach to Previous: Temporal Reference Resolution Rules