Next: Experimental Results
Up: System Evaluation
Previous: System Evaluation
To test our hypothesis about the benefits of personalization in the
ADAPTIVE PLACE ADVISOR, we controlled two independent variables:
the presence of user modeling and the number of times a user
interacted with the system. First, because we anticipated that users
might improve their interactions with the PLACE ADVISOR over
time, we divided subjects into an experimental or modeling group and a
control group. The 13 subjects in the modeling group interacted with a
version of the system that updated its user model
as described in Section 3.4.
The 11 subjects in the control group interacted with a
version that did not update the model, but that selected attributes
and items from the default distribution described in
Section 3.1. Naturally, the users were unaware of their
assigned group. Second, since we predicted the system's interactions
would improve over time, as it gained experience with each user, we
observed its behavior at successive points along this ``learning
curve.'' In particular, each subject interacted with the system for
around 15 successive sessions. We tried to separate each subject's
sessions by several hours, but this was not always possible. However,
in general the subjects did use the system to actually help them
decide where to eat either that same day or in the near future; we did
not provide constraints other than telling them that the system only
knew about restaurants in the Bay Area.
To determine each version's efficiency at recommending items, we
measured several conversational variables. One was the average
number of interactions needed to find a restaurant accepted by
the user. We defined an interaction as a cycle that started with the
system providing a prompt and ended with the system's recognition of
the user's utterance in response, even if that response did not
answer the question posed by the prompt.
We also measured the time taken for each
conversation. This began when a ``start transaction'' button was
pushed and ended when the system printed ``Done'' (after the user
accepted an item or quit).
We also collected two statistics that should not have depended on whether
user modeling was in effect. First was
the number of system rejections, that is, the number of times
that the system either did not obtain a
recognition result or that its confidence was too
low. In either case the system asked the user to repeat himself.
Since this is a measure of recognition quality
and not the effects of personalization, we omitted it from the count
of interactions.
A second, more serious problem was a speech misrecognition error
in which the system assigned an utterance a different meaning than the
user intended.
Effectiveness, and thus the subjective quality of the results, was
somewhat more difficult to quantify. We wanted to know each user's
degree of satisfaction with the system's behavior.
One such indication was the rejection rate: the proportion of
attributes about which the system asked but the subject did not care
( REJECT's in ATTEMPT-CONSTRAIN situations). A second
measure was the hit rate: the percentage of conversations in
which the first item presented was acceptable to the user. Finally, we
also administered a questionnaire to users after the study to get more
subjective evaluations.
Next: Experimental Results
Up: System Evaluation
Previous: System Evaluation
Cindi Thompson
2004-03-29