A fair planner comparison must account for likely biases in the problem set. Good performance on a certain class of problems does not imply good performance in general. A large performance differential for planners with a targeted problem domain (i.e., do well on their focus problems and poorly on others) may well indicate that the developers have succeeded in optimizing the performance of their planner.
Recommendation 4: Problem sets should be constructed to highlight the designers' expectations about superior performance for their planner, and they should be specific about this selection criteria.On the other hand, if the goal is to demonstrate across the board performance, then our results at randomly selecting domains suggests that biases can be mitigated.
Recommendation 5: If highlighting performance on ``general'' problems is the goal, then the problem set should be selected randomly from the benchmark domains.