To schedule the prefetches for indirect references, a minor modification of the software pipelining algorithm is needed. Figure shows how software pipelining would work for the example in Figure . The important thing to focus on is the steady-state code. Assuming that five iterations are needed to hide the memory latency, you can see that in the steady state, the index array is prefetched ten iterations ahead. Five iterations after that, the index is dereferenced to compute the indirect array address. This leaves another five iterations to hide the latency of fetching this reference. Therefore, no cache misses will occur. To generate this code, extra prolog and epilog loops are needed. Note that this same technique can be generalized to prefetch an arbitrary number of indirections ahead of time.
One complication with prefetching indirect references is that if the index array is modified within the loop, then the index value might not be valid at the time of prefetch, which can potentially result in a prefetch being issued for an illegal address (e.g., unallocated memory in the virtual address space of the process). With only a single level of indirection, as in Figure , this will not cause a problem since at worst only the prefetch address may be invalid, and prefetches are defined not to take memory exceptions. However, with two or more levels of indirection, not only may prefetch addresses be invalid, but load addresses during dereferences may be invalid as well. Unlike prefetches, load instructions do take memory exceptions on invalid addresses, which will prove either costly or fatal for the application.
For example, consider the loop in Figure (a), where the value of index1[i] is only set within the bounds of the index2 array during the same iteration A[index2[index1[i]]] is referenced. Figure (b) shows the steady state loop for code that prefetches the index1, index2, and A arrays far enough in advance to hide memory latency (a straightforward derivation from the code in Figure (b)). However, since these prefetches are issued before index1[i] is valid, the prefetch of &index2[index1[i+10]] may be to an invalid address (which simply means that the prefetch will be dropped), and the subsequent load of index2[index1[i+5]] when computing &A[index2[index1[i+5]]] may also be invalid, which would result in a load exception.
To avoid suffering load exceptions with two or more levels of indirection, there are two choices. First, if the compiler can determine that there is no possibility that the index values will be modified within the loop nest, then it is safe to schedule the prefetches as normal. Factors that will make this difficult include memory aliasing and procedure calls. If the compiler is uncertain about whether the index values may be modified, then it must be conservative and prefetch no more than the first level of indirection (e.g., it is safe to prefetch &index2[index1[i+10]] but not &A[index2[index1[i+5]]] in Figure (b)). Second, rather than using normal load instructions when computing the indirect prefetch addresses, special non-excepting loads could be used instead. For example, rather than suffering a costly memory exception on an invalid address, a non-excepting load might simply return a value of zero. This way, although the prefetch addresses would still be incorrect, at least the only overhead would be wasted instructions, rather than costly (or potentially fatal) exceptions.
In our experiments, we avoid these memory exception problems by only scheduling prefetches for up to a single level of indirection. For array-based codes, a single level of indirection appears to be the most common case (and is the only case we observe in our benchmarks). With linked lists, however, one could imagine arbitrarily large numbers of indirections.