Spoken dialogue systems promise efficient and natural access to a large variety of information sources and services from any phone. Systems that support short utterances to select a particular function (through a statement such as ``Say credit card, collect or person-to-person") are saving companies millions of dollars per year. Deployed systems and research prototypes exist for applications such as personal email and calendars, travel and restaurant information, and personal banking [Baggia, Castagneri, DanieliBaggia et al.1998,Walker, Fromer, NarayananWalker et al.1998,Seneff, Zue, Polifroni, Pao, Hetherington, Goddeau, GlassSeneff et al.1995,Sanderman, Sturm, den Os, Boves, CremersSanderman et al.1998,Chu-Carroll CarpenterChu-Carroll Carpenter1999] inter alia. Yet there are still many research challenges: current systems are limited in the interaction they support and brittle in many respects.
This paper investigates methods by which spoken dialogue systems can learn to support more natural interaction on the basis of their previous experience. One way that current spoken dialogue systems are quite limited is in their strategies for detecting and repairing problems that arise in conversation, such as misunderstandings due to speech recognition error or misinterpretation. If a problem can be detected, the system can either transfer the call to a human customer care agent or modify its dialogue strategy in an attempt to repair the problem. We can train systems to improve their ability to detect problems by exploiting dialogues collected in interactions with human users where the initial segments of these dialogues are used to train a Problematic Dialogue Predictor (PDP) to predict that a problem is likely to occur. The output of the PDP can be immediately applied to the system's decision of whether to transfer the call to a human customer care agent, or it could potentially be used as a cue to the system's Dialogue Manager to modify its behavior to repair problems, and even perhaps, to prevent them.
In previous work, we reported initial results for training a PDP using a variety of different feature sets [Langkilde, Walker, Wright, Gorin, LitmanLangkilde et al.1999,Walker, Langkilde, Wright, Gorin, LitmanWalker et al.2000b]. When analyzing the performance of the fully automatic feature set, we examined which hand-labelled features made large performance improvements, under the assumption that future work should focus on developing automatic features that approximate the information provided by these hand-labelled features. The analysis indicated that the hand-labelled SLU-success feature, which encodes whether the spoken language understanding ( SLU) component captured the meaning of each exchange correctly. When this hand-labelled feature is added to the automatic features, it improved the performance of the PDP by almost 7.6%. This finding led us to develop an SLU-success predictor [Walker, Wright, LangkildeWalker et al.2000c] and a new version of the PDP that we report on here. The new version of the PDP takes as input a fully automatic version of the SLU-success feature, which we call auto-SLU-success.
We train and test both the auto-SLU-success predictor and the PDP on a corpus of 4692 dialogues collected in an experimental trial of AT&T's How May I Help You ( HMIHYSM) spoken dialogue system [Gorin, Riccardi, WrightGorin et al.1997,Abella GorinAbella Gorin1999,Riccardi GorinRiccardi Gorin2000,E. Ammicht AlonsoE. Ammicht Alonso1999]. In this trial, the HMIHY system was installed at an AT&T customer care center. HMIHY answered calls from live customer traffic and successfully automated a large number of customer requests. An example dialogue that HMIHY completed successfully is shown in Figure 1. The phone numbers, card numbers, and pin numbers in the sample dialogues are artificial.
Note that the system's utterance in S4 consists of a repair initiation, motivated by the system's ability to detect that the user's utterance U3 was likely to have been misunderstood. The goal of the auto-SLU-success predictor is to improve the system's ability to detect such misunderstandings. The dialogues that have the desired outcome, in which HMIHY successfully automates the customer's call, are referred to as the TASKSUCCESS dialogues. Dialogues in which the HMIHY system did not successfully complete the caller's task are referred to as PROBLEMATIC. These are described in further detail below.
This paper reports results from experiments that test whether it is possible to learn to automatically predict that a dialogue will be problematic on the basis of information the system has: (1) early in the dialogue; and (2) in real time. We train an automatic classifier for predicting problematic dialogues from features that can be automatically extracted from the HMIHY corpus. As described above, one of these features is the output of the auto-SLU-success predictor, the auto-SLU-success feature, which predicts whether or not the current utterance was correctly understood [Walker, Wright, LangkildeWalker et al.2000c]. The results show that it is possible to predict problematic dialogues using fully automatic features with an accuracy ranging from 69.6% to 80.3%, depending on whether the system has seen one or two exchanges. It is possible to identify problematic dialogues with an accuracy up to 87%.
Section 2 describes HMIHY and the dialogue corpus that the experiments are based on. Section 3 discusses the type of machine learning algorithm adopted, namely RIPPER and gives a description of the experimental design. Section 4 gives a breakdown of the features used in these experiments. Section 5 presents the method of predicting the feature auto-SLU-success and gives accuracy results. Section 6 presents methods used for utilizing RIPPER to train the automatic Problematic Dialogue Predictor and gives the results. We delay our discussion of related work to Section 7 when we can compare it to our approach. Section 8 summarizes the paper and describes future work.