For our experiments in Chapter , we assume that
if the processor attempts to issue a prefetch while the prefetch issue
buffer is full, the processor stalls until an entry becomes available. We
now consider what happens if prefetches are instead dropped on a full
prefetch issue buffer.
In the architectural model presented so far, the memory subsystem has a finite (16-entry) prefetch issue buffer to hold outstanding prefetch requests. In our model, a prefetch is inserted into the buffer only if it misses in the primary cache and there is not already an outstanding prefetch for the same cache line. Also, a prefetch is removed from the issue buffer as soon as it completes (i.e. the buffer is not a FIFO queue); reordering may occur due to variations in miss latencies. Despite some of these optimizations, the buffer may still fill up if the processor issues prefetches faster than the memory subsystem can service them.
Once the prefetch issue buffer is full, the processor is unable to issue further prefetches. At that point the choices are either to stall the processor until a buffer entry becomes available, or else drop the prefetch and continue executing. Intuitive arguments might be presented to support either approach. On one hand, if the data is needed in the future and is not presently in the cache (since only prefetches that miss go into the buffer), it may appear to be cheaper to stall now until a single entry is free rather than to suffer an entire cache miss sometime in the future. On the other hand, since a prefetch is only a performance hint, perhaps it is better to continue executing useful instructions.
To understand this issue, we ran each of our uniprocessor benchmarks again
using a model where prefetches are dropped rather than stalling the
processor when the prefetch issue buffer is full. We ran this model for
both the indiscriminate and selective prefetching algorithms. Figure
shows the cases where this affected performance.
We begin by focusing on the indiscriminate algorithm, and then later
focus on the selective algorithm.
For all seven cases where the performance of the indiscriminate algorithm
changed (shown in Figure (a)), the performance
improved by dropping prefetches. The improvement is dramatic in the two
cases that had previously stalled the most due to full buffers (CFFT2D and
CG). There are two reasons why the performance improves substantially for
the indiscriminate prefetching algorithm. The first reason is that dropping
prefetches increases the chances that future prefetches will find open
slots in the prefetch issue buffer. The second is that since the
indiscriminate algorithm has a larger number of redundant (i.e.
unnecessary) prefetches, dropping a prefetch does not necessarily lead to a
cache miss. It is possible that the algorithm will issue a prefetch of the
same line before the line is referenced. Dropping prefetches has the effect
of sacrificing some amount of coverage (and therefore memory stall
reduction) for the sake of reducing prefetch issue overhead. This effect is
most clearly illustrated in the case of CG (see
Figure
(a)), where memory stall time doubles for
the indiscriminate algorithm once prefetches are dropped.
The selective prefetch algorithm, in contrast, did not improve from
dropping prefetches since it suffered very little from full prefetch issue
buffers in the first place. In fact, in the three cases shown in Figure
(b), the selective algorithm performed slightly
worse when prefetches are dropped. The reason why is that since selective
prefetching has eliminated many of the redundant prefetches, it is more
likely that dropping a prefetch would translate into a subsequent cache
miss. However, as we have already seen in Figure
,
the selective algorithm tends to suffer very little from full issue
buffers, and therefore performs well in either case.