To make our approach more concrete, consider the example code in Figure
. Matrix A is partitioned such that each
processor modifies its own row based on a value (myVal) which is
computed over the entire A matrix. This computation is repeated over
several timesteps. Barriers are used to synchronize the processors between
computing their copy of myVal and applying it to the row they own.
If the compiler did not take communication into account, then if NumProcs was a small enough constant that the entire A matrix fit in the cache, locality analysis would predict that the A matrix references would have temporal locality along the outer t loop. However, this is incorrect, because modifying the owned rows in the second inner loop will cause them to be invalidated from the other processor's caches, and therefore the temporal reuse of the entire A matrix in the first inner loop does not result in temporal locality.
Our compiler algorithm takes this communication into account by deciding
that the t loop is outside the localized iteration space, since it
contains a barrier statement, and therefore is not likely to carry locality. As a
result, the compiler would decide to prefetch the entire A matrix
each time it enters the first inner loop nesting (i.e. the p loop).
Note that this is somewhat conservative since the row owned by the
processor does remain in the cache since it is not modified by other
processors. The compiler could take advantage of this if it had a better
understanding of the actual communication patterns. However, given that the
latency of coherence misses is likely to be large (since the data may be
contained far away in another processor's cache), the instruction overhead
of conservatively issuing some unnecessary prefetches may be acceptable
given the potential improvement in coverage of coherence misses. We could
also improve upon this further through the use of profiling feedback
information, as we will discuss later in
Section .