Next: Cross-validation Method vs. Hand-labelled-training
Up: Problematic Dialogue Predictor
Previous: Precision and Recall
Examination of the Rulesets
A subset of the rules from the system that uses automatic features for
Exchanges 1&2 are given in Figure 11 (row 3, table
5). One observation from these hypotheses is the
classifier's preference for the asr-duration feature over the
feature for the number of words recognized (recog-numwords).
One would expect longer utterances to be more difficult, but the
learned rulesets indicate that duration is a better measure of
utterance length than the number of words. Another observation is the
usefulness of the SLU confidence scores and the SLU
salience-coverage in predicting problematic dialogues. These features
seem to provide good general indicators of the system's success in
recognition and understanding. The fact that the main focus of the
rules is detecting ASR and SLU errors and that none of the
DM behaviors are used as predictors also indicates that, in all
likelihood, the DM is performing as well as it can, given the noisy
input that it is getting from ASR and SLU. An alternative
view is that two utterances are not enough to provide meaningful
dialogue features such as counts and percentages of reprompts,
confirmations, etc..
One can see that the top two rules use auto-SLU-success. The
first rule basically states that if there is no recognition for the
second exchange (as predicted by the auto-SLU-success) then the
dialogue will fail. The second rule is more interesting as it states
if a misunderstanding has been predicted for the second exchange and
the system label is DIAL-FOR-ME and the utterance is long then
the system will fail. In other words, the system frequently
misinterprets long utterances as DIAL-FOR-ME resulting in task
failure.
Figure 12 gives a subset of the ruleset for the
TASK-INDEPT feature set for Exchanges 1&2. One can see a
similarity between this ruleset and the one given in Figure
11. This is due to the fact that when all the
automatic features are available, RIPPER has a tendency to pick
out the more general task-independent ones, with the exception of sys-label. If one compares the second rule in both figures, one can
see that RIPPER uses recog-numwords as a substitute for
the task-specific feature sys-label.
Next: Cross-validation Method vs. Hand-labelled-training
Up: Problematic Dialogue Predictor
Previous: Precision and Recall
Helen Hastie
2002-05-09