The total instruction overhead depends not only on how frequently
prefetches are issued, but also on how many instructions are necessary to
issue each prefetch. Ideally only a single instruction would be needed for
each prefetch-the prefetch instruction itself. However,
Table shows that several additional instructions are
often needed to generate prefetch addresses. In some cases, as many as 20
extra instructions were added for each prefetch. From our experience with
the SUIF compiler, the primary cause for these large overheads is register spilling. Register spilling occurs whenever the register
allocator runs out of registers, and therefore must ``spill'' values by
saving and restoring them from memory. Once register spilling occurs, the
instruction count within a loop body can increase dramatically.
One of the keys to avoiding register spilling is to reuse addressing
registers between prefetches and loads and stores whenever possible. This
works because prefetch addresses are often separated by a constant distance
from load or store addresses. This difference can be folded into the
constant offset field in the ``base-plus-offset'' addressing mode.
Therefore the prefetch can be issued without consuming any additional
registers, as we illustrated earlier in Figure . The
importance of this optimization was demonstrated during our early
experience with the compiler, when the scalar optimizer had not yet
implemented this optimization. The instruction overhead was quite large in
this early code because of the large amount of register spilling. Once the
scalar optimizer began to exploit these common base registers, the
overheads were dramatically reduced in many cases.
The second potential cause of register spilling is the loop unrolling which
our algorithm performs to exploit spatial locality. Once a loop is
unrolled, the compiler can optimize across the several replicated copies of
the loop body. In most cases this allows the compiler to reduce the overall
instruction count for the following reasons: (i) branch instructions can be
eliminated between unrolled iterations; (ii) it is more likely that
``hazard'' slots after multi-cycle operations can be filled with
independent instructions; and (iii) register allocation can be optimized
across the unrolled iterations, possibly eliminating loads and stores. One
such example is MXM, where the savings through loop unrolling resulted in
fewer instructions in the code with prefetching than in the original code
(see Figure ). On the other hand, the potential
downside of loop unrolling is creating too many intermediate values for the
register allocator to handle, thus resulting in register spilling. This
occurred in the uniprocessor version of OCEAN, where each prefetch ended up
costing 18 instructions. This number is so high because register spilling
occurs for a large number of values in the loop, and these additional
instructions are averaged over only the small number of prefetches. The
result of this spilling is that OCEAN's instruction overhead is noticeably
large despite the fact that misses occur infrequently (only once every 46
instructions). One way for the compiler to avoid these register-spilling
problems is to perform loop unrolling with a greater awareness of its
effect on register pressure. In our algorithm, we only suppressed loop
unrolling to avoid code explosion, and did not take register pressure into
consideration.
Although these techniques for avoiding register spilling are strictly compiler-based, we will now discuss ways to further reduce instruction overhead which also require architectural support.