It is interesting to examine what types of features are the most discriminatory in determining whether a dialogue is problematic or not. RIPPER was trained separately on sets of features based on the groups given in Figure 7, namely Acoustic/ ASR, SLU, Dialogue and Hand-labelled (including SLU-success). These results are given in Table 9.
For Exchange 1, only the SLU features, out of the automatic feature sets, yields an improvement over the baseline. Interestingly, training the system on the ASR yields the best result out of the automatic feature sets for Exchange 1&2 and the whole dialogue. These systems, for example, use asr-duration, number of recognized words, and type of recognition grammar as features in their ruleset.
Finally, we give results for the system trained only on auto-SLU-success and hlt-SLU-success. One can see that there is not much difference in the two sets of results. For Exchanges 1&2, the system trained on hlt-SLU-success has an accuracy which is significantly higher than the system trained on auto-SLU-success by a paired t-test (df=866, t=3.0, p=0.03). On examining the ruleset, one finds that the hlt-SLU-success uses RPARTIAL-MISMATCH where the auto-SLU-success ruleset does not. The lower accuracy may be due to the fact that the auto-SLU-success predictor has a low recall and precision for RPARTIAL-MISMATCH as seen in Table 2.