In a previous user study [Pausch, 1990] we demonstrated that voice used in parallel with mouse input decreased task completion time for users of the popular Macintosh application MacDraw (Claris, version 1.9.6) by 21%. Our conjecture is that we reduced task completion time by reducing the amount of mouse motion required to access menu items.
Another way to reduce menu access time is to provide keyboard accelerators, sometimes called "hot keys" or "menu accelerator keys." In our previous study, we prohibited the use of keyboard accelerators. This paper presents a follow-up study where we test the hypothesis that voice input is faster than using keyboard accelerators. Keyboard accelerators reduced task completion time by 15% if the accelerators were memorized and by 10% if the accelerators were not memorized. Since adding voice input reduced task time by 21%, we conclude that voice input is a more effective input modality than keyboard accelerators, even with a relatively small set of commands.
Leggett and Williams [1984] performed one such study. Twenty-four subjects, twelve male and twelve female, used either voice or mouse input to edit program segments. All subjects were novice voice users and experienced keyboard users. The study was broken into two types of tasks: input, where the users would read written program text and enter it into the system, and edit, where the subject would make changes to program text already in the system. The voice system made use of forty vocabulary words which were trained five to six times each. Given the same time limit, the keyboard users were able to complete 70% of the task while voice input users were only able to complete 50%-55% of the task. Voice also had a lower error rate. The authors concluded that the main reason for the difference between the input modes was the inexperience of the users in using the voice equipment.
Martin [1989] performed a study using a VLSI chip designing package. Since the VLSI design system made use of single word input, key presses, and mouse clicks, it was chosen for this experiment so that speech input could be compared to a variety of other input modalities. This system would also test the overhead associated with voice input since the system is highly interactive, with between twenty and thirty commands typically executed per minute. Voice was expected to win because the subjects' hands would be free to perform other tasks. Seven subjects were used, but only four were able to contribute data. All had recently completed a graduate-level course in VLSI design. The voice system used was a discrete word recognizer with a head-mounted microphone. The voice system was configured so that it could be turned off and on with a voice command. The system had a vocabulary of one-hundred twenty words. Two types of tasks were tested: structured and design. The structured tasks consisted of two tasks given to the subject who had a fixed amount of time to complete each task. For the first task they were encouraged to use voice input; voice input was not available for the second task. For the design tasks, the subjects picked two tasks from a group of available tasks or came up with their own task. Once again, the subject was given a fixed amount of time to spend on each task. Speech input was found to be roughly equivalent to mouse clicks, but significantly better than keyboard input. When voice was used, the completion time was 24% faster than when single key presses were used and 108% faster than single-word typed commands. Overall, voice users were able to complete 62% of the tasks while for tasks without voice, only 38% of the tasks were completed. The error rate for subjects using voice input ranged between 8% and 12%.
Poock [1982] devised an experiment where voice or keyboard were used to supply commands to a computer system to perform simple tasks. Voice input was found to be 17% faster than typed input. Further, keyboard input produced significantly more errors than voice input.
Cochran, Riley, and Stewart [1980] had users enter connections between items in an electrical circuit. In this experiment, voice input took longer to perform the tasks than mouse input, but produced less errors.
Nye [1982] devised a baggage routing experiment. Users either entered a three digit destination code via the keyboard or spoke the name of the destination city. Voice input was found to be faster and produced far less errors than the manual input. Errors for speech input were about 1% while errors for typed input ranged from 10% to 40%.
Haller, Mutschler, and Voss [1984] had subjects correct simple typing mistakes and move the cursor to various positions on the screen using either voice input or keyboard input. For cursor movement, voice was found to be worse than other methods tested, i.e. light pen, graphic tablet, mouse, and cursor keys. Voice users had to speak the (x, y) co-ordinate of the new cursor position to move it. For error correction, voice was only tested against keyboard input. The keyboard was found to be slightly faster and less error prone than speech input.
Visick, Johnson, and Long [1984] performed an experiment where users sorted a deck of cards that contained names of a destination city and entered them in sorted order into the computer system. The keyboard used had one key for each destination city and it was marked with the city's name. Since voice input users did not need to use their hands for input, they could sort and provide input at the same time. Voice input decreased the amount of time required for this operation by 37%.
These experiments can only conclude if voice input was faster than an alternative input mode, such as the mouse, for their particular application domain. The varying results, i.e. sometimes voice is faster and other times it is not, further demonstrates this fact. This limitation arises because the applications used were developed for the purposes of the experiment. In order to generalize the result, widely-used applications need to be tested. The preceding studies also limit the user to using one input mode or the other. User interface designers know that in order to be practical, voice input will rarely be used as the only mode of input. To really test the practicality of voice input, real systems need to test voice input in addition to other input modes.
A small number of studies have measured the combination of voice and other input modalities. Pausch and Leatherby [1990] used voice input to augment a graphical editor. Sixteen subjects were randomly broken into two groups. The first group used mouse input only and the second group used voice input in conjunction with mouse input. Each subject used their input mode to reproduce eight simple "line art" drawings from hard copy. The voice system used was speaker dependent discrete word recognition. The system consisted of nineteen vocabulary words each was which was trained five times by the subject. Voice users had an overall speedup of 21% as compared to mouse only users. The voice recognition had an error rate of 5%.
Karl, Pettey, and Shneiderman [1992] used voice input to augment a word processor. Sixteen subjects, ten male and six female, were randomly divided into two groups. The first group used mouse input first and then repeated the trials with voice plus mouse. The second group used voice plus mouse first and then repeated the trials using mouse input only. The voice system used was speaker dependent discrete word recognition. The system consisted of eighteen vocabulary words, each of which was trained at least three times by each of the subjects prior to beginning the study. Four tasks were selected for use with the word processor. In the first task, the subject used either the voice system or the mouse to re-format a document with six pre-defined styles (no typing was required in this task.) In the second task, the subjects typed a short scientific formula which contained subscripts, superscripts, bold text, and Greek symbols. In the third task, the subjects built a table using the copy, paste, up, and down functions. In the final task, the subjects typed a short paragraph that consisted of subscripts, superscripts, italic, and bold text. Voice users were found to have a speedup of 19% over mouse only input users. The voice recognition error rate was found to be 5%.
The subjects participated in two drawing sessions. In each session the subject first created a practice drawing and was then timed while creating four drawings. We used the same set of eight drawings from our previous study, so that we could compare the results. The drawings were chosen randomly from recent issues of Communications of the Association for Computing Machinery, Science, and the Journal of the American Institute of Chemical Engineers. We randomly selected drawings, instead of devising drawings specifically for the study, in order to avoid biasing the task. For each drawing, the subject started with a blank MacDraw screen and a printed copy of the artwork. The subject was allowed to study the artwork as long as desired before beginning the timed task.
The keyboard accelerators were constructed using Macro-Maker (Apple Computer Inc., version 1.0.2) for the Macintosh operating system (Apple Computer Inc., version 6.0.3). Following the standard Macintosh user-interface convention, all keyboard accelerators were invoked by holding down the "clover" key as a shift key, and then pressing a single keyboard key. The keyboard accelerators used in the study are shown in Table 1:
Some of the command names were modified from the earlier study in order to make them more mnemonic. In most cases, the letter used to activate the command is either the first letter of the command name or some letter that distinguishes the command from the others. The commands Cut, Paste, Select All, and Undo violate this convention, but were chosen to match the standard accelerators used by most Macintosh applications. Other commands that had accelerators provided by MacDraw were changed, if possible, to make them more mnemonic.
The average speedup per picture was 15% when the `advanced" group was compared to mouse input, and 13% when the "novice" group was compared to mouse input. This calculation ignores the fact that the individual pictures had a large variation in their complexity; by counting each picture's speedup equally in the average, we bias the result towards the simpler pictures. For example, a picture whose drawing time decreased from 20 seconds to 10 seconds would have a 50% reduction, and a picture whose drawing time decreased from 1000 seconds to 900 seconds would have a 10 per reduction. Computing a 30% average reduction for these two drawings is technically correct, but a better measure of time reduction is obtained by dividing the sum of the total raw time. In this example, dividing 910 by 1020 yields 90.2, or a 10.8% overall reduction in task time. When we perform this calculation, we find an overall time reduction of 15% when the "advanced" group is compared to mouse input only and 10% when the "novice" group is compared to mouse input.
There were two drawings for which voice input did not yield an increase when compared to one of the two groups. In both cases, there was a relatively large amount of text in the drawings, so the typing speed of the individuals became the dominant issue.
For the keyboard accelerators in this study, the user needed to hold the "clover" key down while pressing another key. For most keys, this was accomplished with one hand while the user kept his or her other hand on the mouse. For some accelerators, the user needed to use both hands, which required homing to the keyboard and then back to the mouse. While shifting can be avoided by using dedicated function keys, shifted keys are the standard mechanism for the Macintosh, so we used them. We also used a very small number of accelerator keys in this study; we expect that as the number of accelerator keys grow and the key-strokes become less obvious the advantage that voice input provides will increase.
A final observation is that although we had expected memorization of the keyboard accelerators to be a large issue, it was not. The novice and advanced groups performed similarly. Although the command set contains seventeen distinct commands, only a small number of these were used frequently during the study. For the most part the novices learned these keys during the course of a single drawing.
Jim Leatherby is a Master student in Computer Science at the University of Virginia, and is also a member of the technical staff at GE/Fanuc Automation, Inc. in Charlottesville, VA. He is a member of the Association for Computing Machinery, and his research interests include voice input, software engineering, and the proper construction of user studies involving human-computer interfaces.