As researchers tired of Blocksworld, many called for additional benchmark problems and environments. Mark Drummond, Leslie Kaelbling and Stanley Rosenschein organized a workshop on benchmarks and metrics [Drummond et al.1990]. Testbed environments, such as Martha Pollack's TileWorld [Pollack Ringuette1990] or Steve Hanks's TruckWorld [Hanks et al.1993], were used for comparing algorithms within planners. By 1992, UCPOP [Penberthy Weld1992] was distributed with a large set of problems (117 problems in 21 domains) for demonstration purposes. In 1995, Barry Fox and Mark Ringer set up a planning and scheduling benchmarks web page (http://www.newosoft.com/~benchmrx/) to collect problem definitions, with an emphasis on manufacturing applications. Recently, PLANET (a coordinating organization for European planning and scheduling researchers) has proposed a planning benchmark collection initiative (http://planet.dfki.de).
Clearly, benchmark problems have become well-established means for demonstrating planner performance. However, the practice has known benefits and pitfalls; Hanks, Pollack and Cohen (1994) discuss them in some detail in the context of agent architecture design. The benefits include providing metrics for comparison and supporting experimental control. The pitfalls include a lack of generality in the results and a potential for the benchmarks to unduly influence the next generation of solutions. In other words, researchers will construct solutions to excel on the benchmarks, regardless of whether the benchmarks accurately represent desired real applications.
To obtain the benefits just listed for benchmarks, the problems often are idealized or simplified versions of real problems. As Cohen (1991) points out , most research papers in AI, or at least at an AAAI conference, exploit benchmark problems; yet few of them relate the benchmarks to target tasks. This may be a significant problem; for example, in a study of flowshop scheduling2 benchmarks, we found that performance on the standard benchmark set did not generalize to performance on problems with realistic structure [Watson et al.1999]. A study of just Blocksworld problems found that the best known Blocksworld benchmark problems are atypical in that they require only short plans for solution and optimal solutions are easy to find [Slaney Thiebaux2001].
In spite of these difficulties, benchmark problems and the AIPS competitions have considerably influenced comparative planner evaluations. For example, in the AIPS 2000 conference proceedings [Chien et al.2000], all of the papers on improvements to classical planning (12 out of 44 papers at the conference) relied heavily on comparative evaluation using benchmark problems; the other papers concerned scheduling, specific applications, theoretical analyses or special extensions to the standard paradigm (e.g., POMDP, sensing). Of the 12 classical papers, six used problems from the AIPS98 competition benchmark set, six used problems from Kautz and Selman's distribution of problems with blackbox [Kautz2002] and three added some of their own problems as well. Each paper showed results on a subset of problems from the benchmark distributions (e.g., Drew McDermott's from the first competition) with logistics, blocksworld, rocket and gripper domains being most popular (used in 11, 7, 5 and 5 papers, respectively). The availability of planners from the competition was also exploited; eight of the papers compared their systems to other AIPS98 planners: blackbox, STAN, IPP and HSP (in 5, 3, 3 and 1 papers, respectively).