Results

Detailed results for the test sets are presented in this section, starting with results for the CMU data (see Table 2). Accuracy measures the extent to which the system produces the correct answer, while precision measures the extent to which the system's answers are correct (see the formulas in Table 2). For each component of the extracted temporal structure, the system's correct and incorrect answers were counted. Since null values occur quite often, these counts exclude cases in which the system's answer, the correct answer, or both answers are null. Those cases were counted separately. Note that each test set contains three complete dialogs with an average of 72 utterances per dialog.

Table 2: Evaluation of System on CMU Test Data

Label	Cor	Inc	Mis	Ext	Nul	Poss	Act	BaseAcc	Acc	Prec
start
Month	49	3	7	3	0	59	55	0.338	0.831	0.891
Date	48	4	7	3	0	59	55	0.403	0.814	0.873
DayofWeek	46	6	7	3	0	59	55	0.242	0.780	0.836
HourMin	18	0	7	0	37	62	55	0.859	0.887	1.000
TimeDay	9	0	18	0	35	62	44	0.615	0.710	1.000
end
Month	48	3	7	1	3	61	55	0.077	0.836	0.927
Date	47	5	6	3	1	59	56	0.048	0.814	0.857
DayofWeek	45	7	6	3	1	59	56	0.077	0.780	0.821
HourMin	9	0	9	0	44	62	53	0.862	0.855	1.000
TimeDay	4	0	13	1	44	61	49	0.738	0.787	0.980
Overall	323	28	87	17	165	534	604	0.428	0.809	0.916

Legend
Cor(rect):	System and key agree on non-null value
Inc(orrect):	System and key differ on non-null value
Mis(sing):	System has null value for non-null key
Ext(ra):	System has non-null value for null key
Nul(l):	Both system and key give null answer

Poss(ible):	Correct + Incorrect + Missing + Null
Act(ual):	Correct + Incorrect + Extra + Null
Base(line)Acc(uracy):	Baseline accuracy (input used as is)
Acc(uracy):	% Key values matched correctly ((Correct + Null)/Possible)
Prec(ision):	% System answers matching the key ((Correct + Null)/Actual)

These results show that the system achieves an overall accuracy of 81%, which is significantly better than the baseline accuracy (defined below) of 43%. In addition, the results show a high precision of 92%. In some of the individual cases, however, the results could be higher due to several factors. For example, our system development was inevitably focused more on some fields than others. An obvious area for improvement is the system's processing of the time of day fields. Also, note that the values in the Mis column are higher than those in the Ext column. This reflects the conservative coding convention, mentioned in Section 3, for filling in unspecified end points.

The accuracy and precision figures for the hour & minute and time of day fields are very high because a large proportion of them are null. We include null correct answers in our figures because such answers often reflect valid decisions not to fill in explicit values from previous Temporal Units.

Table 3 contains the results for the system on the NMSU data. It shows that the system performs respectably, with 69% accuracy and 88% precision, on the more complex set of data. The precision is still comparable, but the accuracy is lower, since more of the entries are left unspecified (that is, the figures in the Mis column in Table 3 are higher than in Table 2). Furthermore, the baseline accuracy (29%) is almost 15% lower than the one for the CMU data (43%), supporting the claim that this data set is more challenging.

Table 3: Evaluation of System on NMSU Test Data

Label	Cor	Inc	Mis	Ext	Nul	Poss	Act	BaseAcc	Acc	Prec
start
TimeDay	9	0	18	0	35	62	44	0.615	0.710	1.000
Month	55	0	23	5	3	63	81	0.060	0.716	0.921
Date	49	6	23	5	3	63	81	0.060	0.642	0.825
DayofWeek	52	3	23	5	3	63	81	0.085	0.679	0.873
HourMin	34	3	7	6	36	79	80	0.852	0.875	0.886
TimeDay	18	8	31	2	27	55	84	0.354	0.536	0.818
end
Month	55	0	23	5	3	63	81	0.060	0.716	0.921
Date	49	6	23	5	3	63	81	0.060	0.642	0.825
DayofWeek	52	3	23	5	3	63	81	0.060	0.679	0.873
HourMin	28	2	13	1	42	73	85	0.795	0.824	0.959
TimeDay	9	2	32	5	38	54	81	0.482	0.580	0.870
Overall	401	33	221	44	161	639	816	0.286	0.689	0.879

The baseline accuracies for the test data sets are shown in Table 4. These values were derived by disabling all the rules and evaluating the input itself (after performing normalization, so that the evaluation software could be applied). Since null values are the most frequent for all fields, this is equivalent to using a naive algorithm that selects the most frequent value for each field. Note that in Tables 2 and 3, the baseline accuracies for the end month, date, and day of week fields are quite low because the coding convention calls for filling in these fields, even though they are not usually explicitly specified. In this case, an alternative baseline would have been to use the corresponding starting field. This has not been calculated, but the results can be approximated by using the baseline figures for the starting fields.

The rightmost column of Table 4 shows that there is a small amount of error in the input representation. This figure is 1 minus the precision of the input representation (after normalization). Note, however, that this is a close but not exact measure of the error in the input, because there are a few cases of the normalization process committing errors and a few of it correcting errors. Recall that the input is ambiguous; the figures in Table 4 are based on the system selecting the first ILT in each case. Since the parser orders the ILTs based on a measure of acceptability, this choice is likely to have the relevant temporal information.

The above results are for the system taking ambiguous semantic representations as input. To help isolate errors due to our model, the system was also evaluated on unambiguous, partially corrected input for all the seen data (the test sets were retained as unseen test data). The input is only partially corrected because some errors are not feasible to correct manually, given the complexity of the input representation.

The overall results are shown in the Table 5. The table includes the results presented earlier in Tables 2 and 3, to facilitate comparison. In the CMU data set, there are twelve dialogs in the training data and three dialogs in a held out test set. The average length of each dialog is approximately 65 utterances. In the NMSU data set, there are four training dialogs and three test dialogs.

Table 5: Overall Results

seen/	cmu/	ambiguous, uncorrected/	Dialogs	Utterances	Acc	Prec
unseen	nmsu	unambiguous, partially corrected
seen	cmu	ambiguous, uncorrected	12	659	0.883	0.918
seen	cmu	unambiguous, partially corrected	12	659	0.914	0.957
unseen	cmu	ambiguous, uncorrected	3	193	0.809	0.916
seen	nmsu	ambiguous, uncorrected	4	358	0.679	0.746
seen	nmsu	unambiguous, partially corrected	4	358	0.779	0.850
unseen	nmsu	ambiguous, uncorrected	3	236	0.689	0.879

In both data sets, there are noticeable gains in performance on the seen data going from ambiguous to unambiguous input, especially for the NMSU data. Therefore, the semantic ambiguity and input errors contribute significantly to the system's errors.

Some challenging characteristics of the seen NMSU data are vast semantic ambiguity, numbers mistaken by the input parser for dates (for example, phone numbers are treated as dates), and the occurrences of subdialogs.

Most of the the system's errors on the unambiguous data are due to parser error, errors in applying the rules, errors in mistaking anaphoric references for deictic references (and vice versa), and errors in choosing the wrong anaphoric relation. As will be shown in Section 8.1, our approach handles focus effectively, so few errors can be attributed to the wrong entities being in focus.