We implemented our prefetching algorithm in the SUIF (Stanford University Intermediate Form) compiler, which includes many of the standard optimizations and generates code competitive with the MIPS 2.10 compiler [80]. The experience of actually implementing our algorithm raised some interesting issues.
The first issue is when prefetching should occur relative to the other
optimizations performed by the compiler. In particular, should prefetch
insertion occur before or after the bulk of scalar optimization? Inserting
prefetches before scalar optimization has the disadvantage that the
code seen by the prefetching pass may change significantly once
optimization takes place; these changes may make it difficult to schedule
prefetches the proper amount of time in advance. For example, if the
optimizer eliminates a significant fraction of instructions in a loop body,
the software pipelining algorithm may underestimate the number of loop
iterations necessary to provide enough computation to hide memory latency
(i.e. the parameter in equation (
) may be
overestimated). On the other hand, inserting prefetches after scalar
optimization has the following two disadvantages: (i) much of the
high-level representation of arrays and loops may be unrecognizable, making
it difficult to perform locality analysis; and (ii) the compiler must be
extremely careful only to insert highly efficient prefetching code, since
scalar optimization will not be run again.
In our experience, the advantages of inserting prefetches before
scalar optimization far outweigh the disadvantages. First, the ability to
recognize high-level structures such as arrays and ``for'' loops is
essential for our analysis, and becomes prohibitively difficult once scalar
optimization has radically altered the code. Second, the fact that scalar
optimization is performed on the prefetching code itself has two strong
advantages: (i) it greatly simplifies the task of inserting prefetches,
since code transformations such as loop unrolling and software pipelining
can be done in a straightforward manner without undo concern for minimizing
overhead; and (ii) scalar optimization appears to do a very good job of
minimizing prefetching instruction overhead. The overhead is often less
than 2 or 3 instructions per prefetch (as we will see later in
Chapter ), and we believe these results are much
better than if the entire burden of minimizing overhead was placed on the
prefetch code-generating pass itself. Regarding the disadvantage of
inserting prefetches before scalar optimization, we found that in practice
the impact of reduced loop body sizes could be taken into account by simply
increasing the target memory latency parameter (i.e.
in equation
(
)) by a small factor (less than two). Although this
may result in prefetches being issued unnecessarily early (e.g., if scalar
optimization does not eliminate any instructions from the loop body), it
does not appear to have a noticeable negative impact on performance.
There are some optimizations, however, that should be performed before prefetching, since they greatly improve the analysis. One such
example is interprocedural constant propagation, which helps eliminate
symbolic loop bounds and therefore makes it easier to compute the localized
iteration spaces. Table shows the order in which
our compiler performs its optimizations. The optimizations performed before
prefetching in Table
tend to improve the
information content of the loop and array constructs. After prefetches are
inserted, these high-level SUIF constructs (such as hierarchical abstract
representations of ``for'' loops) are deconstructed into simpler low-level
constructs (such as branch instructions) which more closely match
traditional RISC instruction sets. The bulk of scalar optimization (e.g.,
common subexpression elimination, etc.) is then performed on this
low-level representation. In general, we were quite happy with this
approach.
We did experience one surprising problem with the scalar optimizer, however, during our early experiments with the compiler. Our initial code with prefetching had very large instruction overheads-often more than 6 instructions per prefetch. We discovered the problem was that our scalar optimizer did not take full advantage of the ``base-plus-offset'' addressing mode in the MIPS instruction set by reusing the same base register and simply having different constant offsets when two addresses differed by a constant amount. Instead, the optimizer would use a separate base register to compute each address. While this optimization may not have been a high priority for normal code, it was crucial for our prefetching code, since each prefetch tends to differ by a constant amount from a load or store, and loop unrolling creates many copies of array references that differ by constant amounts. Without this optimization, the compiler quickly ran out of registers, and suffered the large overheads of spilling values to memory. Once this optimization was implemented, we saw dramatic reductions in prefetch instruction overhead.
0