In order to evaluate scaling behaviour we first explore whether the competing planners agree on what makes a problem, within a particular domain and level, hard. Although it might seem straightforward to ensure that a problem set consists of increasingly difficult problems (for example, by generating instances of increasing size) in fact it is not straightforward to achieve this. It appears that problem size and difficulty are not strongly correlated, whether size is taken as a measure of the number of objects, the number of relations or even the number of characters in a problem description. Although a coarse relationship can be observed -- very large instances take more time to parse and to ground -- small instances can sometimes present more difficult challenges than large instances. This indicates that factors other than size appear to be important in determining whether planners can solve individual instances.
In summary, the hypotheses explored in this section are:
Null Hypothesis: The planners differ in their judgements about which individual problem instances are hard within a given domain/level combination.
Alternative Hypothesis: The planners demonstrate significant agreement about the relative difficulties of the problem instances within any given domain/level combination.
In this section we are specifically concerned with a within-domain/level analysis and with whether planners agree on the relative difficulty of problem instances within a given domain/level combination.