To make our approach more concrete, consider the example code in Figure . Matrix A is partitioned such that each processor modifies its own row based on a value (myVal) which is computed over the entire A matrix. This computation is repeated over several timesteps. Barriers are used to synchronize the processors between computing their copy of myVal and applying it to the row they own.
If the compiler did not take communication into account, then if NumProcs was a small enough constant that the entire A matrix fit in the cache, locality analysis would predict that the A matrix references would have temporal locality along the outer t loop. However, this is incorrect, because modifying the owned rows in the second inner loop will cause them to be invalidated from the other processor's caches, and therefore the temporal reuse of the entire A matrix in the first inner loop does not result in temporal locality.
Our compiler algorithm takes this communication into account by deciding that the t loop is outside the localized iteration space, since it contains a barrier statement, and therefore is not likely to carry locality. As a result, the compiler would decide to prefetch the entire A matrix each time it enters the first inner loop nesting (i.e. the p loop). Note that this is somewhat conservative since the row owned by the processor does remain in the cache since it is not modified by other processors. The compiler could take advantage of this if it had a better understanding of the actual communication patterns. However, given that the latency of coherence misses is likely to be large (since the data may be contained far away in another processor's cache), the instruction overhead of conservatively issuing some unnecessary prefetches may be acceptable given the potential improvement in coverage of coherence misses. We could also improve upon this further through the use of profiling feedback information, as we will discuss later in Section .