Figure summarizes the major steps of our prefetching
algorithm. The example code and resulting data locality are shown in
Figures
(a) and
(b), respectively.
Figure
(c) shows the result of the analysis phase of our
algorithm, where we have constructed predicates indicating when all three
array references should be prefetched. Finally, Figure
(d) shows
the resulting code with prefetching, which is generated by
the scheduling phase of our algorithm as follows.
First, our algorithm performs loop splitting based on the prefetch
predicates. The ``i = 0'' predicate resulting from the temporal
locality of the B[j+1][0] reference causes the compiler to peel the
i loop. As we see in Figure (d), the prefetches for the
B matrix occur only in the peel of i. Also, to isolate the
spatial locality of the A[i][j] reference, the ``(j mod 2) = 0'' predicate causes the j loop to be unrolled by a
factor of 2-both in the peel and in the main iterations of the i
loop. As a result, Figure
(d) shows that there is only one
prefetch of the A matrix for every two copies of the loop body.
Once the miss instances have been isolated through loop splitting, the
compiler then software pipelines the prefetches as follows. Using the
algorithm in Figure , our compiler determines
that the shortest path through the loop body is 36 instructions long. Given
that memory latency is 100 cycles for this example, equation
(
) yields that
iterations are sufficient to hide the latency. Once this
iteration count is determined, the code transformation is mechanical. In
Figure
(d) we see the resulting prolog, steady
state, and epilog loops in both the peel and the main iterations of
the i loop. Thus the code in Figure
(d) issues all of the
useful prefetches early enough to overlap the memory accesses with
computation on other data. (This is a source-level representation of the
actual code generated by our compiler for this case).