The 3rd International Planning Competition focussed on the issue of temporal planning and numeric resource handling. The competition was structured around two categories: fully-automated and hand-coded planning systems, and four main problem levels: STRIPS, NUMERIC, SIMPLETIME and TIME. There were eight domains, one of which was intended for the hand coded planners only (the UM-TRANSLOG domain), and two solely for the fully-automated planners (the FreeCell and Settlers domains). Fourteen competitors took part, eleven in the fully-automated track and three in the hand coded track. The domain description language used was PDDL2.1, an extension of the PDDL standard designed for modelling temporal and resource-intensive domains. PDDL2.1 is described in another paper in this issue [Fox LongFox Long2003].
We collected a data set of some five and a half thousand data points distributed over the domains and levels. An initial plotting of these points, in terms of the relative time and quality performances of the planners in the different domains, revealed a number of interesting patterns. These suggest some characteristics of the relative performances of the competitors within the competition domains. These patterns were presented and discussed at the AIPS conference with which the final competition events were co-located. However, other patterns, such as those indicating relative performances across domains and those showing the perceived difficulty of the competition domain/level combinations, were invisible in the data presented in this way. This paper presents the results of some detailed statistical analyses of the competition data, aimed at identifying some of these deeper patterns.
This paper explores three experimental themes. The first theme is aimed at answering the question: which planner should I buy? This question is concerned with which planner emerges as the strongest overall performer, rather than which produced the best results in a particular domain/level combination, and it can be asked from the point of view of both speed and quality criteria. To answer it we performed comparisons, based on the Wilcoxon rank-sum matched-pairs test, enabling the construction of partial orders on the competing planners in terms of their time and quality performances in the four levels of the competition. From these partial orders it can be concluded that, if a potential user is interested in good behaviour across a broad range of temporal and numeric problems then LPG, amongst the fully-automated planners, and TLPLAN, amongst the hand-coded planners, are the best choices. Of course, if more specialised coverage is required and speed of solution is paramount then other choices might be made.
The second theme considers the dependence of planner performance on domain structure. We were interested in exploring the extent to which the competing planners agree about which domain/level combinations were hard and which were easy. The analysis we performed in addressing the first of these issues is a statistical complement to the theoretical analysis of domain topologies being carried out by Hoffmann hofftop. We considered the competition domains at all four levels used in the competition, whilst Hoffmann considers only the STRIPS subset of the competition domains (he also considers ADL domains, but we did not use any of these in the competition). It is interesting to note that our findings were broadly consistent with his conclusions.
The third theme considered the scaling behaviour of the competing planners. We considered two related issues: the extent to which the competing planners agreed on the relative difficulty of the problem instances within domain/level combinations and the extent to which the planners scaled similarly in domain/level combinations where there was agreement. Our intentions in pursuing the first issue were to provide an objective scale that would support our efforts to investigate the relative scaling behaviours of the planners. Because we found relatively little agreement over the perceived difficulty of problems within domain/level combinations we were able to perform only a restricted comparison of relative scaling behaviour. However, we consider the results we obtained to make an interesting contribution to a deeper comparison of planner performances than is available from the raw domain-specific data.
There are many other questions that it would be interesting to be able to answer. Amongst these are questions about the extent to which hand-coding of domain knowledge really benefits a planner and the amount (and value) of effort involved in encoding such knowledge. This is a pressing question for the community and one that the competition series might be well-suited to try to answer. However, in order to pursue these in future competitions it will be necessary to carefully design controlled experiments aimed at exploring the precise hypotheses. We have been restricted in this paper to the post hoc analysis of the competition data, and this has clearly restricted the kinds of questions we have been able to ask and answer. However, we hope that both the results and the methodologies presented here will be of interest to the planning community and that they will help to encourage further scientific evaluation of performance in the field.