Examination of the Rulesets

Next: Cross-validation Method vs. Hand-labelled-training Up: Problematic Dialogue Predictor Previous: Precision and Recall

Examination of the Rulesets

A subset of the rules from the system that uses automatic features for Exchanges 1&2 are given in Figure 11 (row 3, table 5). One observation from these hypotheses is the classifier's preference for the asr-duration feature over the feature for the number of words recognized (recog-numwords). One would expect longer utterances to be more difficult, but the learned rulesets indicate that duration is a better measure of utterance length than the number of words. Another observation is the usefulness of the SLU confidence scores and the SLU salience-coverage in predicting problematic dialogues. These features seem to provide good general indicators of the system's success in recognition and understanding. The fact that the main focus of the rules is detecting ASR and SLU errors and that none of the DM behaviors are used as predictors also indicates that, in all likelihood, the DM is performing as well as it can, given the noisy input that it is getting from ASR and SLU. An alternative view is that two utterances are not enough to provide meaningful dialogue features such as counts and percentages of reprompts, confirmations, etc..

One can see that the top two rules use auto-SLU-success. The first rule basically states that if there is no recognition for the second exchange (as predicted by the auto-SLU-success) then the dialogue will fail. The second rule is more interesting as it states if a misunderstanding has been predicted for the second exchange and the system label is DIAL-FOR-ME and the utterance is long then the system will fail. In other words, the system frequently misinterprets long utterances as DIAL-FOR-ME resulting in task failure.

Figure 12 gives a subset of the ruleset for the TASK-INDEPT feature set for Exchanges 1&2. One can see a similarity between this ruleset and the one given in Figure 11. This is due to the fact that when all the automatic features are available, RIPPER has a tendency to pick out the more general task-independent ones, with the exception of sys-label. If one compares the second rule in both figures, one can see that RIPPER uses recog-numwords as a substitute for the task-specific feature sys-label.

Next: Cross-validation Method vs. Hand-labelled-training Up: Problematic Dialogue Predictor Previous: Precision and Recall

Helen Hastie
2002-05-09