In this section we evaluate the effectiveness of our core prefetching
algorithm, which was described earlier in Chapter .
We start with a brief, high-level evaluation of the overall performance of
the compiler algorithm. We then focus in greater detail on each of the
three key aspects of the compiler algorithm-locality analysis, loop
splitting and software pipelining-in
Sections
,
, and
,
respectively.
The results of our first set of experiments are shown in
Figure and Table
.
Figure
shows the overall performance
improvement achieved through our selective prefetching algorithm. For
each benchmark, the two bars correspond to the cases with no
prefetching (N) and with selective prefetching (S). In
each bar, the bottom section is the amount of time spent executing
instructions (including instruction overhead of prefetching), and the
section above that is the memory stall time. For the prefetching
cases, there is also a third component-stall time due to memory
overheads caused by prefetching. Specifically, the stall
time corresponds to two situations: (1) when the processor attempts to
issue a prefetch but the prefetch issue buffer is already full, and
(2) when the processor attempts to execute a load or store when the
cache tags are already busy with a prefetch fill.
As shown in Figure , the overall speedup ranges
from 5%to 100%, with 6 of the 13 benchmarks improving by over 45%. The
memory stall time is significantly reduced in all the cases. Table
indicates that this is accomplished by
reducing both the primary miss-rate and the average primary-miss penalty.
The miss penalty is reduced because even if a prefetched line is replaced
from the primary cache before it can be referenced, it is still likely to
be present in the secondary cache. Also, the miss latency may be partially
hidden if the miss occurs while the prefetch access is still in progress.
Overall, 33%to 90%of the original memory stall cycles are eliminated.
Having established the benefits of prefetching, we now focus on the
costs. Figure shows that the instruction
overhead of prefetching causes less than a 15%increase in instruction
count in over half of the benchmarks. In fact, in two of those cases
(MXM and IS) the number of instructions actually decreased due to
savings through loop unrolling. In other cases (CHOLSKY, BTRIX,
VPENTA, TOMCATV, OCEAN), the number of instructions increased by 25%to
50%. Finally, the stalls due to prefetching memory overhead are
typically small-never more than 15%of original execution time. In
each case, we observe that the overheads of prefetching are low enough
compared to the gains that the net improvement remains large.
These high-level results suggest that our prefetching algorithm is successful at improving performance. To evaluate the algorithm in greater depth, the following three subsections focus specifically on each key aspect of the selective prefetching algorithm. We begin with the first step in our algorithm, which is using locality analysis to determine which references should be prefetched.