Task-independent Features

Next: Hand-labelled Features Up: Problematic Dialogue Predictor Results Previous: Problematic Dialogue Predictor Results

Task-independent Features

Rows 4 and 5 give the results using the AUTO, TASK-INDEPT feature set described in Figure 9 without and with the auto-SLU-success feature, respectively. These results are significantly above the baseline using a paired t-test, with Exchanges 1&2 giving an increase of 13.1% (df=866, t=8.6, p=0.001) using TASK-INDEPT features with auto-SLU-success. By comparing rows 4 and 5, one observes an increase in the AUTO, TASK-INDEPT features set when the feature auto-SLU-success is added using Exchanges 1&2 and whole dialogue. The 1.9% increase for Exchanges 1&2 shows a trend (df=866, t=1.7,p=0.074), whereas the 2% increase for the whole dialogue is statistically significant by a paired t-test (df=866, t=3.0, p=0.003).

Although, the TASK-INDEPT feature sets are a subset of those features used in row 3, it is possible for them to perform better because the TASK-INDEPT features are more general, and because RIPPER uses a greedy algorithm to discover its rule sets. For Exchanges 1&2, the increase from row 3 to row 5 (both of which use auto-SLU-success) is not significant. Comparing rows 2 and 4, neither of which use auto-SLU-success, one sees a slight degradation in results for the whole dialogue using TASK-INDEPT features. However, the increase from rows 2 to 5 from 78.1% to 80.3% for Exchanges 1&2 is statistically significant (df=866, t=2.0, p=0.042). This shows that using auto-SLU-success in combination with the set of TASK-INDEPT features produces a statistically significant increase in accuracy over a set of automatic features that does not include this feature.

The main purpose of these experiments is to determine whether a dialogue is potentially problematic, therefore using the whole dialogue is not useful in a dynamic system. Using Exchanges 1&2 produces accurate results and would enable the system to adapt in order to complete the dialogue in the appropriate manner.

Table 5: Accuracy % results for predicting problematic dialogues.

Row	Features	Exchange 1	Exchange 1&2	Whole
1	Baseline	67.1	67.1	67.1
2	AUTO (no auto-SLU-success)	70.1	78.1	87.0
3	AUTO + auto-SLU-success	69.6	79.2	84.9
4	AUTO, TASK-INDEPT (no auto-SLU-success)	70.1	78.4	83.4
5	AUTO, TASK-INDEPT + auto-SLU-success	69.2	80.3	85.4
6	AUTO + SLU-success	75.6	85.7	92.9
7	ALL ( AUTO + Hand-labelled)	77.1	86.9	91.7

Next: Hand-labelled Features Up: Problematic Dialogue Predictor Results Previous: Problematic Dialogue Predictor Results

Helen Hastie
2002-05-09