As we saw earlier in Figure , WATER suffers the
least from memory latency of all the SPLASH applications, spending only 7%of its time stalled for memory. Although there is little need for
prefetching in this case, we discovered nonetheless that our algorithm is
unable to cover the misses. The reason why is because the key loop body is
not in the same file as its surrounding loop. Since our prefetching
algorithm does not perform interprocedural analysis-particularly not
across separate files, which becomes very tricky given separate
compilation-it fails to recognize the affine access patterns, and
therefore does not insert any prefetches at all. With either
interprocedural analysis or inlining across separate files, the compiler
could easily prefetch the references and hide the memory latency. Since the
solution to this problem is well-understood, and since there is little
performance gain to be had, we did not bother to insert the prefetches by
hand for this case. WATER is an example of a case where strengthening the
implementation of the existing algorithm would solve the problem.