In this chapter, we evaluate the performance benefits of prefetching for
array-based uniprocessor applications. Section
describes the experimental framework used throughout this chapter,
including our architectural assumptions, benchmarks, compile-time
parameters, and simulation environment. The results of these experiments
are presented in four major subsections. First,
Section
contains a detailed evaluation of the
algorithm described in the previous chapter for prefetching affine array
references. We observe that each component of this core compiler algorithm
is effective at achieving its goal, thereby improving overall execution
time by as much as twofold. Second, Section
evaluates the robustness of this algorithm by varying the compile-time
parameters that are determined heuristically rather than precisely for a
specific architecture (i.e. the effective cache size, the target memory
latency, and the policy on unknown loop bounds). The results show that
these parameter variations affect only a small subset of the applications,
and the performance impact in those cases is generally small; therefore the
algorithm appears to be robust. Third, having already examined prefetching
in isolation, Section
evaluates the interaction
between prefetching and another powerful latency-hiding technique for
dense-matrix codes: locality optimizations. The results illustrate that
prefetching and locality optimizations are complementary, and therefore
should be combined. Finally, having focused thus far only on affine array
references, we extend our core algorithm in Section
to
handle indirect references, which allows us to prefetch sparse-matrix
codes. The results demonstrate that this relatively straightforward
extension improves performance by as much as an additional 20%. Finally,
we conclude the chapter in Section
with a
summary of the important results.