Since prefetching hides rather than reduces latency, it can only improve performance if additional memory bandwidth is available. This is because prefetching does not decrease the number of memory accesses-it simply tries to perform them over a shorter period of time. Therefore, if a program is already memory-bandwidth limited, it is impossible for prefetching to increase performance. Locality optimizations such as cache blocking, however, actually decrease the number of accesses to main memory, thereby reducing both latency and required bandwidth. Therefore, the best approach for coping with memory latency is to first reduce it as much as possible, and then hide whatever latency remains. Our compiler can do both things automatically by first applying locality optimizations and then inserting prefetches.
We compiled each of the benchmarks with the locality optimizer enabled [87]. In two of the cases (GMTRY and VPENTA), there was a significant improvement in locality, and thus performance. Both of these cases are presented in Figure . For each case, we show two sets of three performance bars-the three on the left are without locality optimizations, and the three on the right are with locality optimizations enabled. These latter three bars show locality optimization by itself (N) and in combination with the two prefetching schemes (I and S).