next up previous
Next: Experimental Results Up: System Evaluation Previous: System Evaluation

Experimental Variables

To test our hypothesis about the benefits of personalization in the ADAPTIVE PLACE ADVISOR, we controlled two independent variables: the presence of user modeling and the number of times a user interacted with the system. First, because we anticipated that users might improve their interactions with the PLACE ADVISOR over time, we divided subjects into an experimental or modeling group and a control group. The 13 subjects in the modeling group interacted with a version of the system that updated its user model as described in Section 3.4. The 11 subjects in the control group interacted with a version that did not update the model, but that selected attributes and items from the default distribution described in Section 3.1. Naturally, the users were unaware of their assigned group. Second, since we predicted the system's interactions would improve over time, as it gained experience with each user, we observed its behavior at successive points along this ``learning curve.'' In particular, each subject interacted with the system for around 15 successive sessions. We tried to separate each subject's sessions by several hours, but this was not always possible. However, in general the subjects did use the system to actually help them decide where to eat either that same day or in the near future; we did not provide constraints other than telling them that the system only knew about restaurants in the Bay Area. To determine each version's efficiency at recommending items, we measured several conversational variables. One was the average number of interactions needed to find a restaurant accepted by the user. We defined an interaction as a cycle that started with the system providing a prompt and ended with the system's recognition of the user's utterance in response, even if that response did not answer the question posed by the prompt. We also measured the time taken for each conversation. This began when a ``start transaction'' button was pushed and ended when the system printed ``Done'' (after the user accepted an item or quit). We also collected two statistics that should not have depended on whether user modeling was in effect. First was the number of system rejections, that is, the number of times that the system either did not obtain a recognition result or that its confidence was too low. In either case the system asked the user to repeat himself. Since this is a measure of recognition quality and not the effects of personalization, we omitted it from the count of interactions. A second, more serious problem was a speech misrecognition error in which the system assigned an utterance a different meaning than the user intended. Effectiveness, and thus the subjective quality of the results, was somewhat more difficult to quantify. We wanted to know each user's degree of satisfaction with the system's behavior. One such indication was the rejection rate: the proportion of attributes about which the system asked but the subject did not care ( REJECT's in ATTEMPT-CONSTRAIN situations). A second measure was the hit rate: the percentage of conversations in which the first item presented was acceptable to the user. Finally, we also administered a questionnaire to users after the study to get more subjective evaluations.
next up previous
Next: Experimental Results Up: System Evaluation Previous: System Evaluation
Cindi Thompson
2004-03-29