Problem Assumption 1: To What Extent Is Performance of General Purpose Planners Biased Toward Particular Problems/Domains?

Next: Problem Assumption 2: How Up: Problem Assumptions Previous: Problem Assumptions

Problem Assumption 1: To What Extent Is Performance of General Purpose Planners Biased Toward Particular Problems/Domains?

Although most planners are developed as general purpose, the competitions and previous studies have shown that planners excel on different domains/problems. Unfortunately, the community does not yet have a good understanding of why a planner does well on a particular domain. We studied the impact of problem selection on performance in two ways.

First, we assessed whether performance might be positively biased toward problems tested during development. Each developer⁶ was asked to indicate which domains they used during development. We then compared each planner's performance on their development problems (i.e., the development set) to the problems remaining in the complete test set (rest). We ran 2x2 $\chi ^2$ tests comparing number of problems solved versus failed in the development and test sets. We included only the number solved and failed in the analysis as timed-out problems made no difference to the results⁷.

The results of this analysis are summarized in Table 3; Figure 1 graphically displays the ratio of successes to failures for the development and other problems. All of the planners except C performed significantly better on their development problems. This suggests that these planners have been tailored (intentionally or not) for particular types of problems and that they will tend to do better on test sets biased accordingly. For example, one of the planners in our set, STAN, was designed with an emphasis on logistics problems [Fox Long1999].

Table 3: $\chi ^2$ results comparing outcome on development versus other problems.

	Development		Rest
Planner	Sol.	Fail	Sol.	Fail	$\chi ^2$	P
A	48	56	207	1026	51.70	0.001
B	42	34	226	929	51.27	0.001
C	30	0	549	16	0.13	0.722
G	43	35	233	924	49.56	0.001
H	52	9	234	655	91.41	0.001
I	113	20	328	920	187.72	0.001
J	114	24	388	949	157.62	0.001
K	37	56	203	987	27.82	0.001
L	63	32	358	846	52.13	0.001

**Figure:** Histogram of ratios of success/failures for development and other problems for each of the planners.
$\begin{figure} \setlength{\epsfxsize}{4.5in} \centerline{\epsffile{development-rest.ps}} \end{figure}$

The above analysis introduces a variety of biases. The developers tended to give us short lists that probably were not really representative of what they actually used. The set used is a moving target, rather than stationary as this suggests. The set of problems included in experimentation for publication may be different still. Consequently, for the second part, we broadened the question to determine the effect of different subsets of problems on performance. For each of 10 trials, we randomly selected n domains (and their companion problems) to form the problem set. We counted how many of these problems could be solved by each planner and then ranked the relative performance of each planner. Thus, for each value of n, we obtained 10 planner rankings. We focused on rankings of problems solved for two reasons: First, each domain includes a different number of problems, making the count of problems variable across each of the trials. Second, relative ranking gets to the heart of whether one planner might be considered to be an improvement over another.

We tested values of 5, 10, 20 and 30 for n (30 is half of the domains at our disposal). To give a sense of the variability in size, at , the most problems solved in a trial varied from 11 to 64. To assess the changes in rankings across the trials, we computed rank dominance for all pairs of planners; rank dominance is defined as the number of trials in which planner x's rank was lower than planner y's (note: ties would count toward neither planner). The 13 planners in our study resulted in 78 dominance pairings. If the relative ranking between two planners is stable, then one would expect one to always dominate the other, i.e., have rank dominance of 10.

Table 4 shows the number of pairs having each value (0-10) of rank dominance for the four values for n. For a given pair, we used the highest number as the rank dominance for the pair, e.g., if one always has a lower rank, then the pair's rank dominance is 10 or if both have five, then it is five. Because of ties, the maximum can be less than five. The data suggest that even when picking half of the domains, the rankings are not completely stable: in 56% of the pairings, one always dominates, but 22% have a 0.3 or greater chance of switching relative ranking. The values degrade as decreases with only 27% always dominating for .

Table 4: Rank dominance counts for 10 samples of domains with domain sizes (n) of five through 30.

	Rank Dominance
n	0	1	2	3	4	5	6	7	8	9	10	Total Pairs
5	0	1	2	0	5	7	10	4	10	18	21	78
10	0	3	0	0	4	10	6	7	5	23	20	78
20	0	0	0	0	1	3	8	7	11	8	40	78
30	0	0	0	0	1	1	9	6	9	8	44	78

Next: Problem Assumption 2: How Up: Problem Assumptions Previous: Problem Assumptions