The first step in executing a prefetch is translating the virtual data address to a physical address. Address translation is accelerated in modern RISC processors through a ``translation lookaside buffer'' (TLB), which is simply a cache of recent virtual-to-physical address mappings. Hence the first question is whether a prefetch should be dropped if its virtual address does not match an entry in the TLB-otherwise a TLB fault handler must be run, which is a relatively expensive operation.
The answer to this question is complicated by two conflicting goals. On the
one hand, we would like to hide the latency in situations where we are
legitimately suffering frequent TLB misses, and this cannot occur if the
prefetch is dropped. An example
would be code that iterates across the outer dimensions of large matrices,
in which case each reference may be to a unique page. On the other hand,
one of the desirable properties of prefetch instructions (as mentioned
earlier in Section
) is that they are free to reference
invalid addresses, in which case we would like to drop the prefetch with
minimal performance loss. Since TLBs do not contain invalid address
mappings, an invalid address can only be detected by performing full
address translation, hence suffering the cost of a TLB miss (which can
potentially be hundreds of cycles). This second scenario may occur
frequently in code containing pointers and other indirect references, in
which case this TLB miss overhead may be prohibitively expensive.
Although choosing between these two goals is difficult, since each is
important given its own scenario, we can start by comparing their expected
frequencies. The case where legitimate TLB misses are occurring frequently
is somewhat unlikely for the following reasons. First, it can only be a
sustained problem for applications having both very large data sizes
and very large (at least a page) strides. Although both of these may occur
in some scientific codes, it is far more common to see smaller strides as
the code iterates through inner dimensions of matrices. Smaller strides are
advantageous since they can exploit spatial locality by reusing cache
lines, and we would expect locality optimizations such loop
interchange (as demonstrated in Section ) to continue
enhancing this in the future. Second, since legitimate TLB misses would
occur even without prefetching, then presumably processor designers have
already dealt with this problem by making the TLB sufficiently large. In
contrast, invalid prefetch addresses may occur frequently in any code
containing indirect references (hopefully not, but it is a possibility).
This is independent of both data size and the number of TLB entries. Also,
given the inherent difficulty in prefetching code containing
recursive data structures (as we encountered with PTHOR and BARNES in
Section
), the additional burden of TLB
miss penalties on invalid addresses is likely to make the task hopelessly
frustrating. Therefore if a default policy must be chosen, it is probably
better to drop prefetches on TLB misses.
An alternative to choosing a fixed policy is to allow the software to
select the more appropriate policy by making use of the prefetch hint bits
described earlier in Section . For example, there could be
two types of prefetches: ``speculative'' prefetches, which should be
dropped on TLB misses since the address may be invalid, and
``non-speculative'' prefetches, where it is better to suffer the TLB miss
for the sake of hiding the latency. This approach satisfies both goals,
and may lead to the best overall performance.
Once a valid physical addresses has been computed for a prefetch, it is
ready to be issued to the memory subsystem. The mechanics of how a prefetch
normally proceeds through the memory subsystem will be discussed later in
Section . However, even after a prefetch has
been issued to the memory subsystem, it is still possible to abort it
before it completes. The scenario where this might make sense is when the
memory subsystem queues are already full, and the prefetch cannot proceed
without stalling the processor, as we will discuss next.