Since software-controlled prefetching has a cost as well as a benefit, care
must be taken when inserting prefetches that the cost does not offset much
of the latency-hiding benefit. The first step toward minimizing cost is
prefetching selectively to avoid the pure overhead of unnecessary
prefetches. Our results in Sections
and
demonstrate that selective
prefetching can reduce much of the prefetching overhead, and we discussed
ways to improve this analysis further in Section
. While the remaining overhead after selective
prefetching is typically quite small in comparison with the reduction in
memory stall time, there are still a few cases where additional speedups of
at least 10%could be achieved if it was possible to eliminate the
remaining instruction overhead. In this section we will address the second
step toward reducing prefetching cost, which is minimizing the instruction
overhead of the prefetches that are issued.
Before we begin this discussion, let us consider how future trends are likely to affect the relative importance of prefetching instruction overhead. The first relevant trend is that the gap between processor and memory speeds will continue to grow. As this occurs, the cost of even the current level of instruction overhead will diminish relative to the latency-hiding benefit of each useful prefetch. The second important trend is continued improvements in the ability of processors to exploit instruction-level parallelism through techniques such as superscalar processing [75]. Since prefetch instructions can always be executed in parallel with other operations (because no other operations depend upon their completion), they should benefit well from the exploitation of instruction-level parallelism. Therefore the absolute overhead of processing prefetch instructions is likely to decrease. The combined effect of both of these trends is that prefetch instruction overhead should become less significant in the future.
Given these trends, why do we care about prefetch instruction overhead at
all? The first reason is that although prefetch instructions can
theoretically be executed in parallel with other operations, this will only
result in no overhead if there are resources available for executing the
prefetches that are normally idle. However, the functional units needed to
compute prefetch addresses and issue prefetches will also be busy handling
normal loads and stores. Due to competition for these critical resources,
it is unlikely that prefetch instruction overhead will be completely
hidden. The second reason is that prefetch instruction overhead is an
inherent problem in applications where there are few instructions between
cache misses. For these applications, the difference of only a single
instruction per prefetch can result in a large fractional increase in total
instructions. For example, consider CHOLSKY in
Figure , where selective prefetching increases
the instruction count by roughly 50%. In this case the analysis is nearly
perfect-only 9%of prefetches are unnecessary, and the miss coverage is
97%. The large instruction overhead is because cache misses occur rather
frequently (once every 11 instructions), and issuing each prefetch requires
several instructions (5, on average). Eliminating only a single instruction
per prefetch would decrease the instruction count by roughly 10%in this
case.
Therefore, since the instruction overhead of useful prefetches may be a concern in some cases but is probably not a major hindrance in general, we will discuss techniques for reducing this overhead only briefly in this section.