To complement the Wilcoxon tests we perform some additional analyses to identify whether, given two planners being compared, the magnitude of the difference in performance between the two planners is statistically significant. We perform paired t-tests (see Appendix C) using a subset of pairs of planners. We focus our attention on those pairs for which consistent significant differences were identified, because we consider it not to be meaningful to compare magnitude results for planners where no consistent domination is exhibited. We also restrict our attention to the planners that were, according to the Wilcoxon tests, the most impressive performers at each of the competition levels. We perform separate tests for speed and quality.
When investigating the magnitude of differences in performances it is not meaningful to include problems which were not solved by one of the planners being compared. Using infinite time or quality measures would result in a magnitude value that would grossly distort the true picture. For the magnitude tests we therefore consider only double hits. The price we pay for this is that we give undesirable emphasis to the smaller and easier problems since these are the ones most frequently solved by both planners. This should be borne in mind when interpreting the data.
The hypotheses being considered in this section are:
Null Hypothesis: There is no consistent magnitude difference in the performances between planners.
Alternative Hypothesis: Planners that demonstrate significant differences in consistency of performance also demonstrate corresponding magnitudes in the differences between their performances.