Coverage and Ambiguity of the Relations Defined in the Model

The evaluations presented in this section required detailed, time-consuming manual annotations. The system's annotations would not suffice, because the implementation does not perfectly recognize when a rule is applicable. A sample of four randomly selected dialogs in the CMU training set and the four dialogs in the NMSU training set were annotated.

The counts derived from the manual annotations for this section are defined below. Because this section focuses on the relations, we consider them at the more specific level of the deictic and anaphoric rules presented in Online Appendix 1. In addition, we do not allow trivial extensions of the relations, as we did in the evaluation of the focus model (Section 8.1). The criterion for correctness in this section is the same as for the evaluation of the system: a field-by-field exact match with the manually annotated correct interpretations. There is one exception. The starting and end time of day fields are ignored, since these are known weaknesses of the rules, and they represent a relatively minor proportion of the overall temporal interpretation.

The values for each data set, together with coverage and ambiguity evaluations, are presented in Table 7.

Table 7: Coverage and Ambiguity

CMU Training Set
4 randomly selected dialogs
TimeRefs	TimeRefsC	Interp	CorrI	IncI	DiffI	DiffICorr
78	74	165	142	23	91	85

Coverage (TimeRefsC / TimeRefs) = 95%
Ambiguity (DiffICorr / TimeRefsC) = 1.15
Overall Ambiguity (DiffI / TimeRefs) = 1.17
Rule Redundancy (CorrI / TimeRefsC) = 142/74 = 1.92 %

NMSU Training Set
4 dialogs
TimeRefs	TimeRefsC	Interp	CorrI	IncI	DiffI	DiffICorr
98	83	210	154	56	129	106

Coverage (TimeRefsC / TimeRefs) = 85%
Ambiguity (DiffICorr / TimeRefsC) = 1.28
Overall Ambiguity (DiffI / TimeRefs) = 1.32
Rule Redundancy (CorrI / TimeRefsC) = 154 / 83 = 1.86 %

The ambiguity for both data sets is very low. The Ambiguity figure in Table 7 represents the average number of interpretations per temporal reference, considering only those for which the correct interpretation is possible (i.e., it is DiffICorr / TimeRefsC). The table also shows the ambiguity when all temporal references are included (i.e., DiffI / TimeRefs). As can be seen from the table, the average ambiguity in both data sets is much less than two interpretations per utterance.

The coverage of the relations can be evaluated as (TimeRefsC / TimeRefs), the percentage of temporal references for which at least one rule yields the correct interpretation. While the coverage of the NMSU data set, 85%, is not perfect, it is good, considering that the system was not developed on the NMSU data.

The data also show that there is often more than one way to achieve the correct interpretation. This is another type of redundancy: redundancy of the data with respect to the model. It is calculated in Table 7 as (CorrI / TimeRefsC), that is, the number of correct interpretations over the number of temporal references that have a correct interpretation. For both data sets, there are, on average, roughly two different ways to achieve the correct interpretation.

Table 8 shows the number of times each rule applies in total (column 3) and the number of times each rule is correct (column 2), according to our manual annotations. Column 4 shows the accuracies of the rules, i.e., (column 2 / column 3). The rule labels are the ones used in Online Appendix 1 to identify the rules.

Table 8: Rule Applicability Based on Manual Annotations

CMU Training Set
4 randomly selected dialogs
Rule	Correct	Total	Accuracy
D1	4	4	1.00
D2i	0	0	0.00
D2ii	35	40	0.88
a frame-of-reference deictic relation
D3	1	2	0.50
D4	0	0	0.00
D5	0	0	0.00
D6	2	2	1.00
A1	45	51	0.88
a co-reference anaphoric relation
A2	0	0	0.00
A3i	1	1	1.00
A3ii	35	37	0.95
a frame-of-reference anaphoric rel.
A4	14	18	0.78
a modify anaphoric relation
A5	0	0	0.00
A6i	2	2	1.00
A6ii	1	1	1.00
A7	0	1	0.00
A8	0	0	0.00

NMSU Training Set
4 dialogs
Rule	Correct	Total	Accuracy
D1	4	4	1.00
D2i	0	0	0.00
D2ii	24	36	0.67
a frame-of-reference deictic relation
D3	6	9	0.67
D4	0	1	0.00
D5	0	0	0.00
D6	0	0	0.00
A1	57	68	0.84
a co-reference anaphoric relation
A2	5	5	1.00
A3i	0	0	0.00
A3ii	21	32	0.66
a frame-of-reference anaphoric rel.
A4	27	37	0.73
a modify anaphoric relation
A5	0	1	0.00
A6i	7	9	0.78
A6ii	0	0	0.00
A7	0	0	0.00
A8	0	0	0.00

The same four rules are responsible for the majority of applications in both data sets, the ones labeled D2ii, A1, A3ii, and A4. The first is an instance of the frame of reference deictic relation, the second is an instance of the co-reference anaphoric relation, the third is an instance of the frame of reference anaphoric relation, and the fourth is an instance of the modify anaphoric relation.

How often the system considers and actually uses each rule is shown in Table 9. Specifically, the column labeled Fires shows how often each rule applies, and the column labeled Used shows how often each rule is used to form the final interpretation. To help isolate the accuracies of the rules, these experiments were performed on unambiguous data. Comparing this table with Table 8, we see that the same four rules shown to be the most important by the manual annotations are also responsible for the majority of the system's interpretations. This holds for both the CMU and NMSU data sets.

Table 9: Rule Activation by the System on Unambiguous Data

CMU data set
Name	Used	Fires
D1	16	16
D2i	1	3
D2ii	78	90
a frame-of-reference deictic relation
D3	5	5
D4	9	9
D5	0	1
D6	2	2
A1	95	110
a co-reference anaphoric relation
A2	2	24
A3i	1	1
A3ii	72	86
a frame-of-reference anaphoric rel.
A4	45	80
a modify anaphoric relation
A5	4	5
A6i	10	10
A6ii	0	0
A7	0	0
A8	1	1

NMSU data set
Name	Used	Fires
D1	4	4
D2i	2	2
D2ii	20	31
a frame-of-reference deictic relation
D3	2	3
D4	0	0
D5	0	0
D6	0	0
A1	46	65
a co-reference anaphoric relation
A2	6	12
A3i	0	2
A3ii	18	27
a frame-of-reference anaphoric rel.
A4	24	42
a modify anaphoric relation
A5	3	5
A6i	6	8
A6ii	0	0
A7	0	0
A8	0	0