Comparative evaluation in planning has been significantly influenced and expedited by the Artificial Intelligence Planning and Scheduling (AIPS) conference competitions. These competitions have had the dual effect of highlighting progress in the field and providing a relatively unbiased comparison of state-of-the-art planners. When individual researchers compare their planners to others, they include fewer other planners and fewer test problems because of time constraints.
To support the first competition in 1998 [McDermott2000], Drew McDermott defined, with contributions from the organizing committee, a shared problem/domain definition language, PDDL [McDermott et al.1998] (Planning Domain Definition Language). Using a common language means that planners' performance can be directly compared, without entailing hand translation or factoring in different representational capabilities.
As a second benefit, the lack of translation (or at least human accomplished translation) meant that performance could be compared on a large number of problems and domains1. In fact, the five competition planners were given a large number of problems (170 problems for the ADL track and 165 for the STRIPS track) within seven domains, including one domain that the planner developers had never seen prior to the competition. So the first competition generated a large collection of benchmarks: seven domains used in the competition plus 21 more that were considered for use. All 28 domains are available at ftp://ftp.cs.yale.edu/pub/mcdermott/domains/. The second competition added three novel domains to that set.
A third major benefit of the competitions is that they appear to have motivated researchers to develop systems that others can use. The number of entrants went from five in the first competition to 16 in the second. Additionally, all of the 1998 competitors and six out of sixteen of the 2000 competitors made their code available on web sites. Thus, others can perform their own comparisons.
In this paper, we describe the current practice of comparative evaluation as it has evolved since the AIPS competitions and critically examine some of the underlying assumptions of that practice. We summarize existing evidence about the assumptions and describe experimental tests of others that had not previously been considered. The assumptions are organized into three groups concerning critical decisions in the experiment design: the problems tested, the planners included and the performance metrics collected.
Comparisons (as part of competitions or by specific researchers) have proven to be enormously useful to motivating progress in the field. Our goal is to understand the assumptions so that readers know how far the comparative results can be generalized. In contrast to the competitions, the community cannot legislate fairness in individual researcher's comparative evaluations, but readers may be able to identify cases in which results should be viewed either skeptically or with confidence. Thus, we conclude the paper with some observations and a call for considerably more research into new problems, metrics and methodologies to support planner evaluation.
Also in contrast to the competitions, our goal is not to declare a winner. Our goal is also not to critique individual studies. Consequently, to draw attention away from such a possible interpretation, whenever possible, we report all results using letter designators that were assigned randomly to the planners.