The goal of loop splitting is to isolate the cache miss instances while
introducing as little instruction overhead as possible.
To quantify the advantage of loop splitting, we implemented
the naive alternative for isolating cache miss instances-placing
conditional statements inside the loops. Figure
shows that for 5 of the 13 benchmarks (MXM,BTRIX,VPENTA,CG and MG),
the performance advantage of loop splitting is greater than 25%.
A good measure of the success of loop splitting is the instruction overhead per prefetch issued. Ideally, isolating the cache miss instances will not increase the instruction overhead. One of the advantages of having implemented the prefetching schemes in the compiler is that we can quantify this instruction overhead. Previous studies have only been able to estimate instruction overhead [9].
Table shows the number of instructions required to
issue each prefetch. For the indiscriminate prefetching scheme, the
overhead per prefetch ranges from 1 to 4 instructions. This is well within
the bounds of what one would intuitively expect. One of the instructions is
the prefetch itself, and the rest are for address calculation. For the
selective prefetching scheme, we show overhead both with respect to the
original code and with respect to code where the loop splitting
transformations have been performed but no prefetch instructions are
inserted. Loop splitting generally increases the overhead per prefetch. In
a few cases, the overhead with respect to the original code is actually
negative, due to the savings through loop unrolling (MXM, IS and CG). In
two cases (OCEAN and MG), the overhead has become quite large-more than
17 instructions per prefetch. In the case of OCEAN, the loop bodies are
quite large, and the combination of loop unrolling and software pipelining
makes it necessary for the compiler to spill registers. The penalty for
register spills is averaged over just the prefetches, and this penalty can
become quite high. In the case of MG, the number of prefetches has been
drastically reduced (by a factor of 21). Averaging all the loop and
transformation overheads over only a small number of prefetches results in
a high instruction-per-prefetch overhead. In most of the cases, however,
the overhead per prefetch remains low-5 or fewer instructions with
respect to the original code for 9 of the 13 benchmarks.
Finally, after predicting and isolating the dynamic miss instances, the last step in our algorithm is to schedule prefetches ahead of the references using software pipelining.