To schedule the prefetches for indirect references, a minor
modification of the software pipelining algorithm is needed. Figure
shows how software pipelining would work for the
example in Figure
. The important thing to focus on is
the steady-state code. Assuming that five iterations are needed to hide the
memory latency, you can see that in the steady state, the index array is
prefetched ten iterations ahead. Five iterations after that, the
index is dereferenced to compute the indirect array address. This leaves
another five iterations to hide the latency of fetching this reference.
Therefore, no cache misses will occur. To generate this code, extra prolog
and epilog loops are needed. Note that this same technique can be generalized
to prefetch an arbitrary number of indirections ahead of time.
One complication with prefetching indirect references is that if the index
array is modified within the loop, then the index value might not be valid
at the time of prefetch, which can potentially result in a prefetch being
issued for an illegal address (e.g., unallocated memory in the virtual
address space of the process). With only a single level of indirection, as
in Figure , this will not cause a problem since at
worst only the prefetch address may be invalid, and prefetches are defined
not to take memory exceptions. However, with two or more levels of
indirection, not only may prefetch addresses be invalid, but load
addresses during dereferences may be invalid as well. Unlike prefetches,
load instructions do take memory exceptions on invalid addresses,
which will prove either costly or fatal for the application.
For example, consider the loop in
Figure (a), where the value of index1[i] is only set within the bounds of the index2 array during
the same iteration A[index2[index1[i]]] is referenced.
Figure
(b) shows the steady state loop for
code that prefetches the index1, index2, and A arrays far
enough in advance to hide memory latency (a straightforward derivation from
the code in Figure
(b)). However, since these
prefetches are issued before index1[i] is valid, the prefetch of &index2[index1[i+10]] may be to an invalid address (which simply means
that the prefetch will be dropped), and the subsequent load of index2[index1[i+5]] when computing &A[index2[index1[i+5]]] may also
be invalid, which would result in a load exception.
To avoid suffering load exceptions with two or more levels of indirection,
there are two choices. First, if the compiler can determine that there is
no possibility that the index values will be modified within the loop nest,
then it is safe to schedule the prefetches as normal. Factors that will
make this difficult include memory aliasing and procedure calls. If the
compiler is uncertain about whether the index values may be modified, then
it must be conservative and prefetch no more than the first level of
indirection (e.g., it is safe to prefetch &index2[index1[i+10]] but
not &A[index2[index1[i+5]]] in
Figure (b)). Second, rather than using
normal load instructions when computing the indirect prefetch addresses,
special non-excepting loads could be used instead. For example,
rather than suffering a costly memory exception on an invalid address, a
non-excepting load might simply return a value of zero. This way, although
the prefetch addresses would still be incorrect, at least the only overhead
would be wasted instructions, rather than costly (or potentially fatal)
exceptions.
In our experiments, we avoid these memory exception problems by only scheduling prefetches for up to a single level of indirection. For array-based codes, a single level of indirection appears to be the most common case (and is the only case we observe in our benchmarks). With linked lists, however, one could imagine arbitrarily large numbers of indirections.