The goal of the analysis phase of prefetching is to decide which references should be prefetched. With the software-based approach, this analysis is performed by considering all references in the code and deciding whether or not they need to be prefetched. If so, the prefetches are scheduled by moving them back in time within the code. The hardware, however, does not have the luxury of examining the entire program and then moving prefetches back in time. The main challenge of deciding what to prefetch in hardware is predicting what locations will be referenced in the future. In the various hardware-based prefetching schemes that have been proposed, this prediction is achieved in one of three ways: (i) assume that there is abundant spatial locality (as in the schemes Porterfield studied [64]); (ii) decode ahead in the instruction stream (as in Lee's proposal [53]); or (iii) maintain a history of past access patterns (as in Baer and Chen's proposal [7]). The idea behind the first technique is that whenever a cache line is accessed, neighboring cache lines will be accessed in the near future, so they should also be brought into the cache. The problem, of course, is that this is not always true, and fetching lines unnecessarily can hurt performance both by displacing useful data and by causing memory queueing delays. The second technique attempts to predict access patterns by simply ``getting ahead'' in the instruction stream. However, the limitations of buffering capacity and branch prediction accuracy make it difficult to decode far enough ahead to hide large latencies. Finally, by maintaining history information, the hardware attempts to recognize constant-stride access patterns and prefetches ahead whenever those same instructions are executed. This last scheme appears to be the best of the three hardware-based schemes since it can prefetch constant-stride access patterns with arbitrary stride lengths (unlike the ``long cache line'' schemes which only exploit unit-stride patterns) without the expense of decoding ahead in the instruction stream (as in Lee's scheme). However, the analysis capability of a history-based scheme has the following limitations: (i) it can only recognize constant-strides accesses (unlike our compiler which can prefetch indirect references); (ii) it can only react to historical behavior (unlike the compiler, which can anticipate future misses); and (iii) the detection scheme depends upon the history table being large enough to contain the entire working set of references, which may not be possible (particularly given that this structure is content-addressable in order to look up the address of the lookahead PC).
So far we have described how the hardware predicts which locations will be
referenced in the future. In contrast with the software-based
approaches, the hardware typically does not bother to predict whether or
not the data is already in the cache. Instead, for each reference the hardware predicts will be
referenced in the future, it simply checks the primary cache to see whether
the data is already present. The obvious advantage of always probing the
cache is that the hardware will never be fooled into predicting that data
is in the cache (thereby not issuing a prefetch) when it actually is not.
For example, if the operating system swaps out the process and later
restarts it, a software-based scheme would have no idea that the entire
working set had been flushed from the cache, but a hardware-based scheme
would still prefetch the data correctly. Part of the reason why the
hardware always probes the cache is that doing so involves no instruction
overhead. However, it may cause cache contention overheads, as we will
discuss later in this subsection. In general, the primary disadvantage of
performing the analysis in the hardware is that it cannot detect complex
access patterns such as indirect references.