For example, our first issue suggests that users have difficulty understanding three-dimensional space. We offer a set of strategies which may help users to better perceive a 3D virtual environment, including the use of spatial references, relative gesture, two-handed interaction, multisensory feedback, physical constraints, and head tracking. We describe interfaces which employ these strategies.
Our major contribution is the synthesis of many scattered results, observations, and examples into a common framework. This framework should serve as a guide to researchers or systems builders who may not be familiar with design issues in spatial input. Where appropriate, we also try to identify areas in free-space 3D interaction which we see as likely candidates for additional research.
An extended and annotated version of the references list for this paper is available on-line through mosaic at address http://uvacs.cs.virginia.edu/~kph2q/.
Thus, rather than trying to identify issues which are applicable to all forms of 3D input, we restrict the present survey to interfaces that employ free-space input devices. Also, to maintain the focus of the survey, we do not discuss general techniques for graphical interaction, such as progressive refinement [4][11], nor do we describe algorithms to overcome artifacts of existing spatial input devices, such as techniques for filtering noise and lag from tracker data [1][42]. Instead we focus on issues which are specific to spatial interaction techniques.
Many results in spatial input are scattered across the literature, without an overall structure in which to view them. The interface designer is faced with numerous descriptions of applications and experiments, without order, organization, or a common nomenclature. There have been a few publications which extract common themes from the examples and studies available, or distill this information into practical suggestions. To make some headway on this problem, the present work seeks to synthesize many results into a common framework, in the form of a series of design issues.
The design issues we present are not well-formulated principles of design or ready-to-go solutions. Rather we present some issues to be aware of and some different approaches to try. Few of the design issues we present have been subjected to formal user studies, so they are supported only by possibly unrepresentative user observations. Nonetheless we believe the present survey of design issues will serve as a useful guide and starting point for the community of designers and researchers wishing to investigate spatial input.
Brooks offers many insightful observations about 3D interfaces in his 1988 SIGCHI plenary address [11]. Our hope is to supplement Brooks's observations with some additional issues which are described in the literature and which we have experienced in our research. We reference some of Brooks's observations, but the reader should be aware that many important issues presented in Brooks's paper are not covered by the present survey.
Nielsen's discussion of noncommand user interfaces [47] covers similar ground, but the scope of Nielsen's work is much broader than this survey. Nielsen's goal is to describe trends in advanced interface design, while by contrast, our goal is to discuss design issues in one class of advanced interfaces, those that employ 3D free-space input.
In general, people are good at experiencing 3D and experimenting with spatial relationships between real-world objects, but we possess little innate comprehension of 3D space in the abstract. People do not innately understand three dimensional reality, but rather they experience it.(1)
From a perceptual standpoint, we could argue that our difficulty in building stone walls, and in performing abstract 3D tasks in general, is a result of our sub-conscious, rather than conscious, perception of 3D reality. For example, the Shepard-Metzler mental rotation study [57] suggests that for some classes of objects, we must mentally envision a rigid body transformation on the object to understand how it will look from different viewpoints; that is, we must perceive the motion to understand the effect of the transformation.
Previous interfaces have demonstrated a number of issues which may facilitate 3D space perception, including the following:
We now further explain these issues using examples drawn from instances of 3D interfaces.
Badler [2] describes an interface where a stylus is used to control the position of a virtual camera. One version of the interface allows the user to indicate the desired view of an imaginary object using the stylus. Badler reports that "the lack of spatial feedback [makes] positioning the view a very consciously calculated activity."
Badler repeated the experiment with a real (as opposed to imaginary) object. He digitized a plastic spaceship and allowed the user to specify the virtual camera view of the corresponding wireframe spaceship by positioning and orienting the wand relative to the real-world plastic spaceship. With this single change, Badler's "consciously calculated activity" suddenly became "natural and effortless" for the operator to control.
In general, to perform a task, the user's perceptual system needs something to refer to, something to experience. In 3D, using a spatial reference (such as Badler's plastic spaceship) is one way to provide this perceptual experience. More precisely, we define a spatial reference as a real-world object relative to which the user can gesture when interacting in 3D.
Ostby's system for manipulating surface patches [49] was a second early system to note the importance of spatial references. Ostby reported that "[locating] a desired point or area [is] much easier when a real object is sitting on the Polhemus's digitizing surface."
In Galyean's 3D sculpting interface [29], the user deforms a 3D model by positioning a single tracker in an absolute, fixed volume in front of a monitor. This leads to an interface which is not entirely intuitive. Galyean reports that "controlling the tool position is not easy. Even though the Polhemus pointer is held in a well-defined region, it is often difficult to correlate the position of the pointer in space with the position of the tool on the screen."
Compare this to Sachs's 3-Draw computer-aided design tool [54], which allows the user to hold a stylus in one hand and a palette in the other (both objects are tracked by the computer). These tools serve to draw and view a 3D virtual object which is seen on a desktop monitor. The palette is used to view the object, while motion of the stylus relative to the palette is used to draw and edit the curves making up the object.
3-Draw's use of the stylus for editing existing curves and Galyean's use of the "Polhemus pointer" for deforming a sculpture represent nearly identical tasks, yet the authors of 3-Draw do not report the difficulties which Galyean encountered. We attribute this difference to the palette-relative gesture employed by 3-Draw, as opposed to the abstract, absolute-space gesture required by Galyean's sculpting interface. As Sachs notes, "users require far less concentration to manipulate objects relative to each other than if one object were fixed absolutely in space while a single input sensor controlled the other" [54].
Thus, users may have trouble moving in a fixed, absolute coordinate frame. A spatial interface could instead base its interaction techniques upon relative motion, including motion relative to a spatial reference or the user's own body.
We have previously described an interface where users can manipulate virtual objects by moving real-world tools or "props" [35] which correspond to the virtual objects, and thus serve as spatial references. Based on our informal observations of test users at various stages of the design, using any spatial reference is better than none. Even an abstract object, such as 3-Draw's palette, a rubber ball, or the user's other hand, can serve as a source for relative gesture. If the spatial reference corresponds closely to the virtual object, the users's tactile and kinesthetic feedback reinforce the visual illusion, but such correspondence is desirable, rather than strictly necessary.
Two-handed input has often been viewed as a technique to improve the efficiency of human-computer interaction, by enabling the user to perform two sub-tasks in parallel [15], rather than as sequentially selected modes. When interacting in three dimensions, we find that using two hands not only improves efficiency, but can also help to make spatial input comprehensible to the user. For example, during informal user observations of a virtual reality interface, we have noted that users of two-handed interaction are less likely to become disoriented versus users who interact with only one hand [50].
Enabling the use of both hands can allow users to ground themselves in the interaction space; in essence the user's own body becomes a spatial reference. Regarding two-handed interaction in free space, Sachs observes that "the simultaneous use of two [spatial input] sensors takes advantage of people's innate ability--knowing precisely where their hands are relative to each other" [54]. Our informal observation of several hundred test users of a two-handed spatial interface for neurosurgical visualization [35] strengthens and reaffirms Sachs's observation: we find that most test users can operate the two-handed interface effectively within their first minute of use. This also reinforces findings by Buxton [15] and Kabbash [39] that users can transfer everyday skills for manipulating tools with two hands to the operation of a computer, with little or no training.
Even when manipulating just a single object in 3D, using two hands can be useful and natural. In a classic wizard-of-oz experiment, Hauptmann [33] observed test subjects spontaneously using two hands for single-object translation, rotation, and scaling tasks. Using two hands can also offer other practical advantages: it is often easier to grasp and rotate a spatial input device with two hands, and fatigue may be reduced since the hands can provide mutual physical support.
Guiard's analysis of human skilled bimanual action [32] provides an insightful theoretical framework for hypothesizing which classes of two-handed interfaces might improve performance without inducing additional cognitive load. Guiard has proposed the following principles based on his observations of right-handed subjects:
A key challenging facing spatial interaction is identifying aspects of the proprioceptive senses that we can take advantage of when interacting in real space. Interacting with imaginary, computer-generated worlds can easily bewilder users; presumably, providing a wide range of sensory feedback might help the user to more readily perceive their virtual environment. Psychologist J. J. Gibson has long argued that information from a variety of feedback channels is crucial to our understanding of space [30].
Brooks [11] discusses interfaces which employ multisensory feedback techniques, including force feedback [12][36][46], space exclusion (collision detection), and supporting auditory feedback. To these techniques we add physical manipulation of tools with mass.
For example, we have experimented with a virtual reality interface for positioning a virtual flashlight using a glove, which users can use to grab and position the virtual flashlight. However, during public demo sessions, we found that users have inordinate difficulty grasping and manipulating the virtual flashlight using the glove. By replacing the glove with a tracked physical flashlight, we found that users could position the virtual flashlight with ease. For this application, physical manipulation of a flashlight worked well, while glove-based manipulation of a virtual flashlight was a disaster.
We see several factors which can contribute to ease-of-use for the physical manipulation paradigm:
For example, Schmandt describes an interface for entering multiple layers of VLSI circuit design data in a 3D stereoscopic work space [55]. The user enters the data by pressing a stylus on a stationary 2D tablet; the user can adjust the depth of the image so that the desired plane-of-depth lines up with the 2D tablet. Versions of the interface which constrained the 3D stylus position to lie on grid points via software mapping were less successful; the physical support of the tablet proved essential.
Other useful 2D constraining surfaces include the physical surface of the user's desk, the glass surface of the user's monitor, or even a hand-held palette or clipboard.
For example, we use a clipboard (held in the non-dominant hand) and a stylus (held in the dominant hand) in a virtual reality application which allows the user to edit the architectural layout of the room they are standing in [59]. The stylus is used to edit a miniature model of the room, which is seen on the virtual counterpart of the real-world clipboard. The clipboard provides a convenient work surface which can be moved out of the way when it is necessary to view the larger context, and also provides an effective metaphor for action-at-a-distance: the user can, for example, move an object on the opposite side of the room by moving its representation on the virtual clipboard. Based on our informal observations of users of this interface, we find that using a combination of physical and software constraints works well.
In a non-immersive spatial interface, desktop-based head tracking can allow the interface to "give back" some of the information lost by displaying 3D objects on a flat display, via head motion parallax depth cues. We merely note head tracking as a technique for spatial feedback; previous research [45][22][66][43] discusses the advantages of head tracking and the implementation details. An additional user study [51] shows performance improvement for a generic search task using an immersive head-tracked, head-mounted display vs. a non-head-tracked display.
The Jacob / Sibert study [37] compares user performance for two tasks: the first asks the user to match the (x, y, size) of two squares, while the second task requires matching the (x, y, greyscale) of two squares. Both tasks require the control of three input dimensions, but Jacob reports that user task performance time for the (x, y, size) task is best with a 3D position tracker, while performance for the (x, y, greyscale) task is best with a mouse (using an explicit mode to change just the greyscale).
Jacob argues that the 3D tracker works best for the (x, y, size) task since the user thinks of these as related quantities ("integral attributes"), whereas the mouse is best for the (x, y, greyscale) task because the user perceives (x, y) and (greyscale) as independent quantities ("separable attributes"). The underlying design principle, in Jacob's terminology, is that "the structure of the perceptual space of an interaction task should mirror that of the control space of its input device" [37].
This result points away from the standard notion of logical input devices. It may not be enough for the designer to know that a logical task requires the control of three input parameters (u, v, w). The designer should also know if the intended users perceive u, v, and w as related or independent quantities. In general it may not be obvious or easy to determine exactly how the user perceives a given set of input dimensions.
Most spatial input devices return six dimensions of input data, but this does not mean that all six dimensions should be used at all times. If, for example, the user's task consists only of orienting an object, it makes little sense to allow simultaneous translation, since this only makes the user's task more difficult: the user must simultaneously orient the object and keep it from moving beyond their field of view. Extraneous input dimensions should be constrained to some meaningful value.
In general, it makes good common sense to exploit task-specific needs to reduce dimensionality. For example, the mouse-based interactive shadows technique [34] allows constrained movement in 2D planes within a 3D scene. If the user's task consists only of such constrained 2D movements, this may result in a better interface than free-space 3D positioning. Presumably this general strategy can scale to the use of spatial input devices.
Ware [65] identifies three basic control metaphors for 3D interaction:
The selection of an appropriate control metaphor is very important: the user's ability to perform 3D tasks intuitively, or to perform certain 3D tasks at all, can depend heavily on the types of manipulation which the control metaphor affords. Brooks addresses this issue under the heading "metaphor matters" [11].
The term dynamic target acquisition refers to target selection tasks such as 3D point selection, object translation, object selection, and docking. As previously suggested, specifying a target based on the absolute (x, y, z) position of the tracker can be a fatiguing, consciously calculated interaction. Instead targeting can be based upon relative motion; options include movement of the user's hand relative to the user's body, relative to the user's other hand, relative to a real object, or relative to the starting point of the gesture.
We present several issues related to dynamic target acquisition tasks:
Transparency is a good general technique to aid in dynamic target acquisition tasks for two reasons:
Other example uses of transparency to aid target acquisition include use of a 3D cone for object selection [43], use of a semi-transparent plane for selecting cross-sections of a polygonal brain [35], and use of a semi-transparent tool sheet in the Toolglass interface [7].
Perhaps the most obvious way to implement point selection is to base it on the (x, y, z) position of the tracker, but in many circumstances 3D ray casting may be a superior strategy for selecting 3D points. Instead of directly specifying the 3D point, the spatial input device is used to shoot a ray into the scene, allowing the user to hold the input device in a comfortable position and rotate it to change the ray direction [43].
The 3D points selectable by casting a ray are constrained to lie on the surface of virtual objects in the scene. In many circumstances this is exactly what is desired. If it is necessary to select points on objects which are inside of or behind other objects in the scene, the ray casting can be augmented with a mechanism for cycling through the set of all ray-object intersection points.
For disconnected 3D points, 3D snap-dragging techniques [6] can be used if the disconnected points are related to existing objects in the scene. If the disconnected points are on the interior of objects, ray casting can be combined with a "cutting plane" operator, which is used to expose the interior of the objects [35][43].
Digitizing points on the surface of a real object is an instance where ray casting may not be helpful. In this case, the real object provides a spatial reference for the user as well as physical support of the hand; as a result, direct 3D point selection works well [49].
For gross object selection, ray casting may become less appropriate, especially if the object may be distant. One could alternatively use a translucent 3D cone to indicate a region of interest; distance metrics can be used to choose the closest object within the cone. Note that "spotlighting" visual effects afforded by many graphics workstations can provide real-time feedback for this task.
We base this strategy on the implementation reported by Liang [43]. It is not presently clear if other strategies, such as using ray casting to sweep out a cone, might provide better results in some cases.
At a low level, all spatial input devices provide the software with an absolute position in a global coordinate frame. The user interface should provide a recalibration mechanism for mapping this absolute position to a new logical position, which allows the user to specify a comfortable resting position in the real world as a center point for the interaction space. We are aware of three basic recalibration strategies:
Command-based: The user explicitly triggers a recalibration command, sometimes referred to as a "centering command" or a "homing command." JDCAD, for example, uses this strategy [43] to bring the 3D cursor to the center of the visible volume.
Ratcheting: Many spatial interfaces (e.g. [18], [64]) utilize the notion of ratcheting, which allows the user to perform movements in a series of grab-release cycles. (The user presses a clutch button, moves the input device, releases the clutch button, returns his or her hand to a comfortable position, and repeats the process).
Continuous: In some cases recalibration can be made invisible to the user. For example, in a virtual reality system, when the user moves his body or head, the local coordinate system is automatically updated to keep their motions body-centric. Another example is provided by our desk-top system [35], where a tool held in the non-dominant hand is used to define a dynamic frame-of-reference relative to which other tools may be moved with the dominant hand. Based on informal observations of several hundred test users, we find that the technique is natural and intuitive.
These strategies can be composed. In a virtual reality application, for instance, the position of the hands will be continuously recalibrated to the current position of the head, but an object in the virtual environment might be moved about via ratcheting, or brought to the center of the user's field of view by a homing command.
Users may have difficulty controlling an interface which requires simultaneous, precise control of an object's position and orientation. The biomechanical constraints of the hands and arms prevent translations from being independent of rotations, so rotation will be accompanied by inadvertent translation, and vice versa. Even in the real world, we typically break down 6DoF tasks, such as docking, into two subtasks: translating to the location and then matching orientations [12].
The design hurdle is this: provide an interface which effectively integrates rapid, imprecise, multiple degree-of-freedom object placement with slower, but more precise object placement, while providing feedback that makes it all comprehensible. As Stu Card has commented, a major challenge of the post-WIMP interface is to find and characterize appropriate mappings from high degree-of-freedom input devices to high degree-of-freedom input tasks.
Applications such as 3-Draw [54] and abstractions such as Gleicher's snap-together math [31] make good initial progress toward providing constrained input in 3D, but we believe the general "spatial input constraint problem," and the issue of providing appropriate feedback in particular, is still a challenging area for future research.
Guiard's observations of subjects performing writing tasks [32] as well as observations of users of our two-handed interface [35] suggest that people tend to move their hands in a surprisingly small working volume. This volume is not only small, but also tends to move over time as the user changes body posture.
Guiard's analysis of handwriting tasks suggests that the writer tends to define an active volume relative to his or her non-dominant hand. Guiard also reports that "the writing speed of adults is reduced by some 20% when instructions prevent the nonpreferred hand from manipulating the page" [32].
This suggests that users of a spatial interface which requires movements relative to a fixed frame-of-reference in their environment may experience reduced task performance due to cognitive load, fatigue, or both. This also reinforces the possible importance of using relative gesture (section 1.2) and providing recalibration mechanisms (section 5).
It can be awkward and fatiguing to repeatedly switch between spatial input devices and traditional input devices such as mice and keyboards. Keyboards are especially problematic because they can get in the user's way. We have noted that users frequently rest their hands on the desk-top while manipulating spatial interface tools [35]; if the keyboard is present, it frequently entangles the cabling for the trackers or otherwise gets in the way.
Alternatives include:
Most spatial interfaces incorporate some type of clutching mechanism, that is, a software mode which allows the spatial input device to be moved without affecting the 3D cursor. In our experience, some of the most confounding (for the user) and hard-to-fix (for the implementor) usability problems and ergonomic difficulties can arise due to poor clutch design.
For example, we have seen users struggle with many different clutch designs in our two-handed spatial interface [35]. In versions of the interface which used more than one clutch (one clutch was provided for each tool), users could operate the interface easily once the operation of the clutches was explained to them, but most users could not infer the operation of the clutches without any instruction. In versions of the interface which used an ill-placed or hard-to-press clutch button, users became fatigued in as little as five minutes of use. A clutch based on voice input also did not seem to work very well. Based on this experience, we suggest that a poor clutching interface can jeopardize the usefulness of spatial input.
As an example clutching mechanism, the University of North Carolina has constructed an input device which consists of a 3D tracker encased in a pool ball, which has a clutch button mounted on its surface [18]. When the user holds the clutch button down, the virtual object follows movements of the pool ball, and when the button is released, movement of the pool ball has no effect.
When a clutch button is mounted at a fixed location on a spatial input device, the user must have a fixed grip on the input device, to keep their fingers in a position to press the clutch button. Due to the kinematic constraints of the wrist, a fixed grip limits the possible rotations which can be performed. If arbitrary, large-angle rotations are required, the resulting interface can be very awkward. In such cases the clutch button should be separated from the input device. For example, one interface which requires arbitrary rotations uses a foot pedal as a clutch [35], allowing the associated spatial input device to be rotated with ease.
If the user's task seldom requires arbitrary rotation, it is preferable to mount the clutch button directly on the input device. Such a button, unlike the foot pedal, is visibly connected to the input device it controls, and its operation is therefore self-revealing.
Another alternative is to have no clutch button at all. If the interface provides a mechanism to take a snapshot of the screen, in some cases the need for clutching might be eliminated altogether.
Manipulating input devices in free space can easily fatigue the user. The designer of a spatial interface must take special pains to avoid or reduce fatigue wherever possible. A poor design risks degraded user performance, user dissatisfaction, and possibly even injury to the user. An exhaustive list of human factors requirements is beyond the scope of this paper, but we can make a few suggestions:
Situations where the issues and strategies we have discussed work well, or where they do not work well, need to be better defined and characterized, and ultimately subjected to formal study. In contemplating formal studies of some of the observations herein, we have been struck by the apparent interdependency of it all: it is extremely difficult to devise experiments which will give insight into one specific phenomenon, without the results being confounded by other effects. Nonetheless, we welcome suggestions for formal experiments.
Multidimensional input is still a hard, unsolved problem, so we cannot hope that the present attempt to distill design issues will address every important issue; we are still learning something new every day. But we believe this paper is at least a good start, and we hope that in the future other researchers will be able to formulate more precise principles of design which will augment or supercede the preliminary results presented here.
Writing this paper has led us to ask many questions which we are currently unable to address, but they should form an agenda for possible future research: