A dialogue consists of a sequence of exchanges where each exchange consists of one turn by the system followed by one turn by the user. Each dialogue and exchange is encoded using the set of 53 features in Figure 7. Each feature was either automatically logged by one of the system modules, hand-labelled by humans, or derived from raw features. The hand-labelled features are used to produce a TOPLINE, an estimation of how well a classifier could do that had access to perfect information. To see whether our results can generalize, we also experiment with using a subset of features that are task-independent described in detail below.
Features logged by the system are utilized because they are produced automatically, and thus can be used during runtime to alter the course of the dialogue. The system modules for which logging information was collected were the acoustic processor/automatic speech recognizer ( ASR) [Riccardi GorinRiccardi Gorin2000], the spoken language understanding ( SLU) module [Gorin, Riccardi, WrightGorin et al.1997], and the Dialogue Manager ( DM) [Abella GorinAbella Gorin1999]. Each module and the features obtained from it are described below.
Automatic Speech Recognition: The automatic speech recognizer ( ASR) takes as input the caller's speech and produces a potentially errorful transcription of what it believes the caller said. The ASR features for each exchange include the output of the speech recognizer (recog), the number of words in the recognizer output (recog-numwords), the duration in seconds of the input to the recognizer (asr-duration), a flag for touchtone input (dtmf-flag), the input modality expected by the recognizer (rg-modality) (one of: none, speech, touchtone, speech+touchtone, touchtone-card, speech+touchtone-card, touchtone-date, speech+touchtone-date, or none-final-prompt), and the grammar used by the recognizer (rg-grammar) [Riccardi GorinRiccardi Gorin2000]. We also calculate a feature called tempo by dividing the value of the asr-duration feature by the recog-numwords feature.
The motivation for the ASR features is that any one of them may reflect recognition performance with a concomitant effect on spoken language understanding. For example, other work has found asr-duration to be correlated with incorrect recognition [Hirschberg, Litman, SwertsHirschberg et al.1999]. The name of the grammar (rg-grammar) could also be a predictor of SLU errors since it is well known that the larger the grammar is, the more likely an ASR error is. In addition, the rg-grammar feature also encodes expectations about user utterances at that point in the dialogue, which may correlate to differences in the ease with which any one recognizer could correctly understand the user's response. One motivation for the tempo feature is that previous work suggests that users tend to slow down their speech when the system has misunderstood them [LevowLevow1998,Shriberg, Wade, PriceShriberg et al.1992]; this strategy actually leads to more errors since the speech recognizer is not trained on this type of speech. The tempo feature may also indicate hesitations, pauses, or interruptions, which could also lead to ASR errors. On the other hand, touchtone input in combination with speech, as encoded by the feature dtmf-flag, might increase the likelihood of understanding the speech: since the touchtone input is unambiguous it can constrain spoken language understanding.
Spoken Language Understanding: The goal of the spoken language understanding ( SLU) module is to identify which of the 15 possible tasks the user is attempting and extract from the utterance any items of information that are relevant to completing that task, e.g. a phone number is needed for the task dial for me.
Fifteen of the features from the SLU module represent the distribution for each of the 15 possible tasks of the SLU module's confidence in its belief that the user is attempting that task [Gorin, Riccardi, WrightGorin et al.1997]. We also include a feature to represent which task has the highest confidence score (top-task), and which task has the second highest confidence score (nexttop-task), as well as the value of the highest confidence score (top-confidence), and the difference in values between the top and next-to-top confidence scores (diff-confidence).
Other features represent other aspects of the SLU processing of the utterance. The inconsistency feature is an intra-utterance measure of semantic diversity, according to a task model of the domain [Abella GorinAbella Gorin1999]. Some task classes occur together quite naturally within a single statement or request, e.g. the dial for me task is compatible with the collect call task, but is not compatible with the billing credit task. The salience-coverage feature measures the proportion of the utterance which is covered by the salient grammar fragments. This may include the whole of a phone or card number if it occurs within a fragment. The context-shift feature is an inter-utterance measure of the extent of a shift of context away from the current task focus, caused by the appearance of salient phrases that are incompatible with it, according to a task model of the domain.
In addition, similar to the way we calculated the tempo feature, we normalize the salience-coverage and top-confidence features by dividing them by asr-duration to produce the salpertime and confpertime features. The tempo and the confpertime and salpertime features are used only for predicting auto-SLU-success.
The motivation for these SLU features is to make use of information that the SLU module has as a result of processing the output of ASR and the current discourse context. For example, for utterances that follow the first utterance, the SLU module knows what task it believes the caller is trying to complete. The context-shift feature incorporates this knowledge of the discourse history, with the motivation that if it appears that the caller has changed her mind, then the SLU module may have misunderstood an utterance.
Dialogue Manager: The function of the Dialogue Manager is to take as input the output of the SLU module, decide what task the user is trying to accomplish, decide what the system will say next, and update the discourse history [Abella GorinAbella Gorin1999]. The Dialogue Manager decides whether it believes there is a single unambiguous task that the user is trying to accomplish, and how to resolve any ambiguity.
Features based on information that the Dialogue Manager logged about its decisions or features representing the ongoing history of the dialogue might be useful predictors of SLU errors or task failure. Some of the potentially interesting Dialogue Manager events arise due to low SLU confidence levels which lead the Dialogue Manager to reprompt the user or confirm its understanding. A reprompt might be a variant of the same question that was asked before, or it could include asking the user to choose between two tasks that have been assigned similar confidences by the SLU module. For example, in the dialogue in Figure 2 the system utterance in S3 counts as a reprompt because it is a variant of the question in utterance S2.
The features that we extract from the Dialogue Manager are the task-type label, sys-label, whose set of values include a value to indicate when the system had insufficient information to decide on a specific task-type, the utterance id within the dialogue (utt-id), the name of the prompt played to the user (prompt), and whether the type of prompt was a reprompt (reprompt), a confirmation (confirm), or a subdialogue prompt (a superset of the reprompts and confirmation prompts (subdial)). The sys-label feature is intended to capture the fact that some tasks may be harder than others. The utt-id feature is motivated by the idea that the length of the dialogue may be important, possibly in combination with other features like sys-label. The different prompt features for initial prompts, reprompts, confirmation prompts and subdialogue prompts are motivated by results indicating that reprompts and confirmation prompts are frustrating for callers and that callers are likely to hyperarticulate when they have to repeat themselves, which results in ASR errors [Shriberg, Wade, PriceShriberg et al.1992,LevowLevow1998,Walker, Kamm, LitmanWalker et al.2000a].
The discourse history features included running tallies for the number of reprompts (num-reprompts), number of confirmation prompts (num-confirms), and number of subdialogue prompts (num-subdials), that had been played before the utterance currently being processed, as well as running percentages (percent-reprompts, percent-confirms, percent-subdials). The use of running tallies and percentages is based on previous work suggesting that normalized features are more likely to produce generalized predictors [Litman, Walker, KearnsLitman et al.1999]. A feature available for identifying problematic dialogues is dial-duration that is not available for initial segments of the dialogue.
Hand Labelling: As mentioned above, the features obtained via hand-labelling are used to provide a TOPLINE against which to compare the performance of the fully automatic features. The hand-labelled features include human transcripts of each user utterance (tscript), a set of semantic labels that are closely related to the system task-type labels (human-label), age (age) and gender (gender) of the user, the actual modality of the user utterance (user-modality) (one of: nothing, speech, touchtone, speech+touchtone, non-speech), and a cleaned transcript with non-word noise information removed (clean-tscript). From these features, we calculated two derived features. The first was the number of words in the cleaned transcript (cltscript-numwords), again on the assumption that utterance length is strongly correlated with ASR and SLU errors. The second derived feature was based on calculating whether the human-label matches the sys-label from the Dialogue Manager (SLU-success). This feature is described in detail in the next section.
In the experiments, the features in Figure 7, excluding the hand-labelled features, are referred to as the AUTOMATIC feature set. The experiments test how well misunderstandings can be identified and whether problematic dialogues can be predicted using the AUTOMATIC features. We compare the performance of the AUTOMATIC feature set to the full feature set including the hand-labelled features and to the performance of the AUTOMATIC feature set with and without the auto-SLU-success feature. Figure 8 gives an example of the encoding of some of the automatic features for the second exchange of the WIZARD dialogue in Figure 3. The prefix ``e2-'' designates the second exchange. We discuss several of the features values here to ensure that the reader understands the way in which the features are used. In utterance S2 in Figure 3, the system says Sorry please briefly tell me how I may help you. In Figure 8, this is encoded by several features. The feature e2-prompt gives the name of that prompt, top-reject-rep. The feature e2-reprompt specifies that S2 is a reprompt, a second attempt by the system to elicit a description of the caller's problem. The feature e2-confirm specifies that S2 is not a confirmation prompt. The feature e2-subdial specifies that S2 initiates a subdialogue and e2-num-subdials encodes that this is the first subdialogue so far, while e2-percent-subdials encodes that out of all the system utterances so far, 50% of them initiate subdialogues.
As mentioned earlier, we are also interested in generalizing our problematic dialogue predictor to other systems. Thus, we trained RIPPER using only features that are both automatically acquirable during runtime and independent of the HMIHY task. The subset of features from Figure 7 that fit this qualification are in Figure 9. We refer to them as the AUTO, TASK-INDEPT feature set. Examples of features that are not task-independent include recog-grammar, sys-label, prompt and the hand-labelled features.