Next: Hand-labelled Features
Up: Problematic Dialogue Predictor Results
Previous: Problematic Dialogue Predictor Results
Rows 4 and 5 give the results using the AUTO, TASK-INDEPT
feature set described in Figure 9 without and with the
auto-SLU-success feature, respectively. These results are
significantly above the baseline using a paired t-test, with Exchanges
1&2 giving an increase of 13.1% (df=866, t=8.6, p=0.001) using
TASK-INDEPT features with auto-SLU-success. By comparing rows
4 and 5, one observes an increase in the AUTO, TASK-INDEPT
features set when the feature auto-SLU-success is added using
Exchanges 1&2 and whole dialogue. The 1.9% increase for Exchanges
1&2 shows a trend (df=866, t=1.7,p=0.074), whereas the 2% increase
for the whole dialogue is statistically significant by a paired t-test
(df=866, t=3.0, p=0.003).
Although, the TASK-INDEPT feature sets are a subset of those features used in row 3, it is possible for them to perform better because the TASK-INDEPT features are more
general, and because RIPPER uses a greedy algorithm to discover
its rule sets. For Exchanges 1&2, the increase from row 3 to row 5 (both of which use auto-SLU-success) is not significant. Comparing rows 2 and 4, neither of which use auto-SLU-success, one sees a slight degradation in results for the whole dialogue using TASK-INDEPT features. However, the increase from rows 2 to 5 from 78.1% to 80.3% for Exchanges 1&2 is statistically significant (df=866, t=2.0, p=0.042). This shows that using auto-SLU-success in combination with the set of TASK-INDEPT features produces a statistically significant increase in accuracy over a set of automatic features that does not include this feature.
The main purpose of these experiments is to determine whether a
dialogue is potentially problematic, therefore using the whole
dialogue is not useful in a dynamic system. Using Exchanges 1&2
produces accurate results and would enable the system to adapt in
order to complete the dialogue in the appropriate manner.
Table 5:
Accuracy % results for predicting problematic dialogues.
Row |
Features |
Exchange 1 |
Exchange 1&2 |
Whole |
1 |
Baseline |
67.1 |
67.1 |
67.1 |
2 |
AUTO (no auto-SLU-success) |
70.1 |
78.1 |
87.0 |
3 |
AUTO + auto-SLU-success |
69.6 |
79.2 |
84.9 |
4 |
AUTO, TASK-INDEPT (no auto-SLU-success) |
70.1 |
78.4 |
83.4 |
5 |
AUTO, TASK-INDEPT + auto-SLU-success |
69.2 |
80.3 |
85.4 |
6 |
AUTO + SLU-success |
75.6 |
85.7 |
92.9 |
7 |
ALL ( AUTO + Hand-labelled) |
77.1 |
86.9 |
91.7 |
|
|
|
|
|
|
Next: Hand-labelled Features
Up: Problematic Dialogue Predictor Results
Previous: Problematic Dialogue Predictor Results
Helen Hastie
2002-05-09