For VPENTA, the locality optimizer introduces spatial locality for every reference in the inner loop by interchanging two of the surrounding loops. In other words, rather than iterating along the columns of the matrices, which results in misses on every iteration since the data are stored in row major order, the code has been restructured to iterate along the rows of the matrices instead. Therefore references will only miss when they cross cache line boundaries, which happens once every four iterations in this case.
With this locality optimization alone, the performance improves significantly. However, the selective prefetching scheme without this optimization performs better, since it manages to eliminate almost all memory stall cycles. Comparing the prefetching schemes before and after the loop interchange, we see that the indiscriminate prefetching scheme improves by only 11%while the selective prefetching scheme improves by 25%. The selective scheme improves more because it recognizes that after loop interchange it only has to issue one fourth as many prefetches. Consequently it is able to reduce its instruction overhead accordingly. However, the indiscriminate scheme does not realize that many of its prefetches are now unnecessary, and therefore continues to suffer from large instruction overhead.
The best overall performance, by a substantial margin, comes only through the combination of both locality optimization and selective prefetching.