Proceedings of Knowledge Discovery in Databases, AAAI Press, August 1997, pp. 2-9.
Copyright 1997 AAAI.
Exploratory data analysis has been called data archaeology [Brachman et al. 1993] to emphasize that it requires a human analyst tightly in the loop who has high level goals in seeking useful information from the data. For goals such as ``find buying patterns we can use to increase sales'' it is impossible to formally characterize what constitutes an answer. A surprising observation may lead to new goals, which lead to new surprises, and in turn more goals. To facilitate this kind of iterative exploration, we would like to provide the analyst rapid, incremental, and reversible operations giving continuous visual feedback.
In some ways, the state of the art 30 years ago was closer to this goal of a tight loop between exploration, hypothesis formation, and hypothesis testing than it is today. In the era of slide rules and graph paper, the analyst's forced intimacy with the data bred strong intuitions. As computer power and data sets have grown, analysis retaining the human focus has become on-line analytical processing (OLAP). In contrast, data mining has emerged as a distinct interdisciplinary field including statistics, machine learning and AI, and database technology, but that does not include human-computer interaction (HCI). KDD recognizes the broader context in which data mining systems are used, and that the majority of the user's effort is in preparatory and evaluatory steps. HCI is thus an important part of an end-to-end KDD system[Fayyad, Piatetsky-Shapiro, & Smyth1996].
Visage provides a scripted rapid-prototyping environment for experimenting with interaction and visualization techniques. Unfortunately, the overhead of interpretation and first-class graphical objects limits the dataset size that can be manipulated rapidly. What we propose here are ways to attain initial intuitions about a large dataset from small samples; to create descriptions of the interesting subsets and attributes for input to cleaning and data mining applications; and to evaluate the discovered rules. We also offer a framework for how these distinct applications can be seamlessly integrated in a single information-centric interface.
A key innovation is the combination of extensional direct manipulation navigation with intensional query construction. The former allows fast opportunistic exploration of the sample, while the latter allows these operations to be carried out automatically on the full dataset. Below we survey the state of the art in interactive visualization systems and illustrate the use of Visage [Roth et al. 1997] on KDD tasks for a real-estate domain. We conclude by discussing the possibility of a fully integrated KDD system based on a system like Visage.
The interactive aspects of previous visualization systems that we make use of affect how datapoints in the visualization can change their appearance based on attributes of the domain objects they represent. In filtering the points corresponding to houses in a certain neighborhood could be made invisible. In brushing [Becker & Cleveland1987] the color of the points is changed. Datapoints can also be dragged and dropped to add or remove them from datasets, or to examine them in different tools. Direct manipulation interfaces are those in which operations are carried out by interacting with interface objects representing the semantic arguments of the operation. For instance, dragging an icon representing a file to a trash can is direct manipulation, while dialog boxes are not.
There are no previous systems that allow query construction by direct manipulation on domain-level data as well as schema-level data. This is largely because it is viewed as the job of the database management system (DBMS) to create a table with a given schema, and the visualization system's job to show all and only the domain-level data in the table. Visualizations are not viewed as full exploration interfaces themselves. Although some drill-down capability may be available, changing a query requires a distinct interface.
Further, each time a query is changed and re-run, a new table is sent to the visualization system, which won't know how to coordinate visualizations from the different tables even if the underlying conceptual objects are identical. For instance, if objects in one visualization are brushed/filtered it's difficult to identify the corresponding objects in other visualizations. The problem is compounded if the various visualizations originate from different database joins where the number of records differ.
Research is being done to increase the amount of exploration possible within a visualization via direct manipulation, such as brushing. Dynamic Query sliders [Ahlberg, Williamson, & Shneiderman1992] filter interface objects representing data objects whose attribute value falls outside the range selected by the slider. They are called ``queries'' because they not only allow interactive exploration, but also make explicit the range of selection. Further, by including histograms on the sliders, useful feedback occurs even in the absence of visualizations displaying every data point, so larger datasets can be handled interactively [Tanin, Beigel, & Shneiderman1996]. Related work has addressed filtering hierarchical information [Kumar, Plaisant, & Shneiderman1995] and the logical operators AND, OR, and NOT [Young & Shneiderman1992]. For KDD, where we want to reuse queries on different data, having an explicit query is very important. However these queries don't involve arithmetic functions of attributes and, more fundamentally, don't involve multiple objects and quantification, nor do they involve aggregation and aggregate functions.
There are some visual query languages, such as GQL [Papantonakis & King1995], that
can express these more complicated queries, but they are not
integrated in a drag and drop manner with visualization systems. Thus
the exploration loop requires the analyst to recall her sequence of
exploration operations and reproduce them in the query language, rerun
the query, and recreate any visualizations. The problem is much the
same as with any other type of disconnected application used in the
KDD process. With very few exceptions, the user is forced to manage
the transmission of data from one application to another,
increasing the overhead of exploration.
In addition to capabilities of previous interactive visualization systems, Visage also includes the Sage expert system [Roth et al. 1994] for automated design of sophisticated visualizations integrating many attributes. Visage is information centric [Roth et al. 1996], which means that data objects are represented as first class interface objects, which can be manipulated using a common set of basic operations, such as drill-down and roll-up, applicable regardless of where they appear: in a hierarchical table, a slide show, a map, or a query. Furthermore these graphical objects can be dragged across application boundaries. Allowing the visualization system direct access to the underlying database, rather than just isolated tables derived from it, is key to interface unification of visualization application with the other components of a KDD system.
A user can navigate from the visual representation of any database object to other related objects. For instance from a graphical object representing a real estate agent, all the houses listed by the agent could be found. It is also possible to aggregate database objects into a new object, which will have properties derived from its elements. For instance, we could aggregate the houses listed by John, and look up their average size. However these operations are only specified procedurally, that is by the sequence of direct manipulation operations - there is no explicit query. Once the user has navigated from John to his houses and aggregated them, he must repeat the operations to do the same for Jill's houses, or to repeat for John's houses next month when the data may have changed. Just as sliders visually indicate the current filtration range for parameterized queries, we would like a declarative and manipulable query representation that also applies to these structural operations. VQE is a visual query environment within Visage where operations are represented explicitly, allowing the analyst to construct complex queries during the analysis process for later reuse.
To illustrate the system we use a fictitious database of house sales based on data collected by a group of Pittsburgh real-estate agents describing clients and sales information for three neighborhoods for 1989. Figure 1 shows an entity-relationship (ER) diagram of the database. Data types are shown as rectangular nodes, and attributes of objects are listed inside the node. Relations between objects are shown as links. The substructure of the links is used to show functional dependencies. Links with three diverging lines indicate one-many relations; those with five lines diverging in both directions indicate many-many relations. A link with a single line would indicate a one-one relation. For instance, the way we have modeled the domain, only one person can be the buyer agent in a given sale, although many sales can have the same person as the buyer agent. The role served by a relation for an object is labeled next to the object. For instance, the relation between sales and houses serves the house-sold role of the sale, while it is the sold-in role of the house.
The remainder of the example, the query proper,
illustrates binding multiple objects so that their properties may be
visualized together, selecting those properties to be visualized and
the method by which they are visualized, filtering objects based on
their properties or on the properties of other objects, coordination
among visualizations derived from distinct queries, and definition of
new properties on the fly.
We begin by randomly selecting 100 houses from the database to scan
for obvious problems. Figure 1 shows a
copy of the House data type being dragged from the ER diagram into
the VQE frame, where it becomes a dynamic aggregate, so
called for reasons to be explained below.
The dynamic aggregate represents a set of houses, and
serves as a locus for browsing and querying operations on the elements
of the set.
Figure 1:Entity-Relationship diagram for the Real Estate domain. Right: The House data type has been dragged to QueryEnvironment (VQE), which creates a dynamic aggregate
containing all (100) houses in the dataset being explored. Histogram sliders have also been dropped on most attributes, but no filtering has been done, so all 100/100 houses are
selected.
First the analyst examines the attributes of the houses. Histograms
of attribute values are so useful that Visage includes a pre-built
visualization for this purpose that also includes a slider to control
selection. In the figure, histogram sliders have been dropped on all
the attributes except address. Often these displays show peaks
corresponding to bad data, for instance a peak at zero indicating
missing data. Here, however, the only surprise is the uneven
distribution of latitude and longitude. Given the analyst's domain
knowledge that the three neighborhoods are adjacent and compact, a
nearly uniform distribution is expected. To examine the house
locations in more detail, the analyst wants to visualize them on a
map, color coding by neighborhood. She uses the SageBrush
tool [Roth et al.
1994] to sketch the desired visualization
(figure 2, upper left).
The Create Picture button sends a
design request to the Sage expert system, which returns an
instantiated picture (figure 2). (In the absence of
a sketch, or with a partial sketch, Sage uses heuristics to design
pictures to facilitate specified analysis tasks on specified
attributes.) Except for a few outliers, the agent recognizes that the
colors of the graphical representation of the houses match her
knowledge of the location of the neighborhoods. In figure 2 (lower left) the analyst has dragged a copy of one
of the outliers to a table, to drill down textually to see what's
wrong. The house's address, 7516 Carriage, is recognized to be a
Point Breeze address, so why is the graphical mark misplaced? Zooming
into the map shows that the house is indeed on Carriage street. It
turns out that there are two Carriage streets in Pittsburgh, and the
address-to-location server randomly chose the wrong one. The other
outliers have the same problem.
Figure 2:Upper Left: SageBrush sketch for creating a map. The big square icon representing a map was dragged from the Map icon at left; the round mark
icon was dragged from the Mark icon at top. Attributes from the House dynamic aggregate were dropped on the mark's property icons. The Create
Picture button then instantiated the SagePicture in the frame on the right. Lower Left: One of the outliers on the map was copy-dragged to an
outliner to examine its address.
The analyst adjusts the latitude and longitude sliders to filter them
out. Figure 3 shows that the outliers have
disappeared from the map, the cardinality of the filtered set (96) has
been updated in the header of the House dynamic aggregate, and the
filtered (conditional) distribution of the other attributes is shown with
the dark histograms. The unconditioned histogram remains as a light
backdrop. All these updates happen continuously as
the slider is moved, making it easy to find good cutoff points.
Figure 3:By moving the latitude and longitude sliders, the outliers have been deselected.
Similar examination of sales and people would be done next, but for
the sake of brevity we skip them in this paper.
In our scenario, the analyst is interested in the effects of market
timing on buyers' decisions. That is, she is interested in seeing the
relationships among attributes of sales (eg selling-price), attributes
of people (eg age), and attributes of houses (eg neighborhoods). To make
this possible, she must first specify which houses go with which
sales, and which people go with which sales. In the navigation
paradigm, where exploration is done by following links between related
objects, she would select a particular house, use a menu to traverse
to a particular sale, and then traverse to a particular buyer. VQE
represents this schematically (figure 4) with a directed graph
structure, where the nodes are dynamic aggregates representing each
prototypical object and the links are the traversals. The direction
of the arrow indicates the direction of navigation, and the
direction-appropriate label is chosen (eg ``sold-in'' rather than
``house-sold''). Navigation operations on dynamic aggregates use
interface actions similar to ordinary navigation in Visage, and we
call it ``parallel navigation,'' because the result node is the
aggregate of the nodes reached by navigation from each
constituent of the origin aggregate. Thus in figure 4 the
analyst has navigated in parallel from 96 houses to 96 sales, and is
about to navigate to the (126) buyers in those 96 sales.
Figure 4:The analyst has already performed parallel navigation from the House dynamic aggregate to a Sale
dynamic aggregate, and is now dragging a navigation arrow to a spot where a third dynamic aggregate will be
placed.
Note that the map, which was designed before the House dynamic aggregate was connected to the Sale and Person dynamic aggregates, is coordinated with the charts. Coordination happens with sliders, drag and drop, and brushing. Figure 5 illustrates brushing, as the houses sold before March 1989 were brushed with a bounding box in the right chart, and the corresponding sales and houses are colored black on the left chart and on the map. The same interface objects can be shared across stages of the discovery process, and in fact the demarcation between stages is blurred. The extreme case is when we change the dataset. For instance the analyst could switch from looking at buyers in these sales to looking at sellers as follows: drag the Person dynamic aggregate to the trash, leaving a 2-node graph just as in figure 4, and then navigating again from sale but this time across the seller relation. Dropping a slider on the sellers birthdate attribute would then control visibility of houses on the map and sales in the existing charts, but now for houses that were sold by the person, rather than houses that were bought by the person.
Obviously, Squirrel Hill sales dominate the dataset. This is apparent from the histogram, the map, and the bar colors. However the map shows that it is also the largest neighborhood geographically, so this pattern is really not surprising. As a fraction of all houses in the neighborhood, the number sold in Squirrel Hill may not differ from that in the other neighborhoods.
Another interesting feature is the shape of the curve formed by the right ends of the date bars. It is steep from about May 1989 to October 1989, and flatter during the winters (January-March 1989 and November-December 1990). Again, this is a very familiar pattern to real estate agents.
Potentially of more interest are the bottom few and top few rows of
the price interval chart. While low price houses dominate overall,
and are pretty evenly distributed vertically, there are fewer at the
extremes, during the winter slowdowns. Moving the selling-price
slider to show only expensive houses (not shown), this impression is
confirmed, as more bars disappear from the middle of the charts, while
the top and bottom extremes remain more dense with bars. By moving
sliders to focus on various subsets of the data, the analyst finds
that many variables are correlated with selling price; expensive
houses take longer to sell, they seem to drop more in price (even as a
percentage), they are exclusively in Squirrel Hill (in the northern
half of Squirrel Hill in fact), and they are bought by older buyers.
In fact, while looking only at older buyers, there is an even more
pronounced increase in density at the date extremes than with
high-priced houses. This is a likely deeper explanation for the
price/date correlation. These patterns are difficult to convey in the
static pictures in this paper. Figure 6 shows the reuse of this exploration session on a subset of the data
including only older buyers. The age-dependence of the distribution
of sale dates is apparent - it is almost linear for these buyers,
with winters being, if anything, more active.
Figure 6:Noticing interesting patterns for older buyers, the analyst wants to reuse the previous visualizations,
but for only the 47 (/126) older buyers she has currently selected with the Birthdate slider. There are two ways
to remove the younger buyers from the QueryEnvironment, illustrating how negation is done in Visage. The
first option is to move the slider to invert the selection, and then drag the newly selected younger buyers to the
trash. Alternatively, the analyst can temporarily store the currently selected older buyers by dragging them to
the desktop, leaving only 79 younger buyers. These are then dragged to the trash, leaving an empty query and
visualizations. Finally the stored object representing the older buyers is dropped back onto the Person dynamic
aggregate. The visualizations automatically zoom in to the new ranges of attribute values as shown.
Back to figure 5, another interesting pattern is the
distribution of date bar lengths. One might expect that this
distribution would tail off for longer bars, but the long bars are
quite salient and it looks like the distribution is almost uniform.
There is no attribute corresponding to this interval, however, so
there is no histogram to look at. So the analyst defines one, using a
spreadsheet-like interface. The resulting histogram has a rather long
tail (figure 6), but nothing as surprising
as first appeared.
The primary point we want to illustrate is that interactive visualization systems have the potential to make it easy to explore data for tentative patterns. Here two visualizations immediately generated several hypotheses for deeper analysis, as well as patterns to steer a data mining algorithm away from. Coordination across multiple applications allows exploration to unfold continuously, without burdening the analyst with rerunning commands, regenerating visualizations at various stages, and still having them uncoordinated. These capabilities were enabled by a visual query language that forges an appropriate compromise between ease-of-use and expressive power, and by the dual extensional/intensional nature of dynamic aggregates. The queries required more expressive power than is available in previous visualization systems, yet were simple to construct by virtue of the analogy to direct manipulation operations in Visage. Queries involving multiple objects mirrors navigation, selecting properties to be visualized is done with a sketch, filtering is done with sliders and brushing, and arithmetic operations are entered as in spreadsheet formulas.
IDEA Selfridge96 is an interactive tool for exploring and analyzing business marketing data. It uses a random subset of the data for exploration, and retains a history of user queries connected by derivational and subsumption relations. Each query is specified relative to a root relational table, and can thus be reused on the full dataset. IDEA emphasizes range selection and aggregation operations (termed segmentation), giving a careful taxonomy and semantics for the latter. However it can only explore a single relational table, and it merely imports and exports data to external tools, rather than offering a common front end for them.
Visage/VQE uses this random subset idea, and has a
direct-manipulation interface (not described here) for specifying aggregation operations. It does not yet have a
history mechanism. Visage relies more heavily on a shared object
database optionally including metadata than does IDEA, although less
so than IDEA's predecessor, IMACS [Brachman et al.
1993].