To experiment with prefetching, we extend our base architecture as follows. We augment the instruction set to include a prefetch instruction that uses a base-plus-offset addressing format and is defined not to take any memory exceptions. The advantages of these properties will be discussed in more detail later in Section , but the basic idea is that base-plus-offset addressing minimizes register usage to avoid spilling, and the non-excepting property allows considerable flexibility in scheduling prefetches (e.g., it is acceptable to prefetch off the end of an array if a proper epilog cannot be constructed). Both levels of the cache are lockup-free [45] in the sense that multiple prefetches can be outstanding along with either a single load or store miss. The primary cache is checked in the cycle the prefetch instruction is executed. If the line is already in the cache, the prefetch is discarded. Otherwise, the prefetch is sent to a prefetch issue buffer, which is a structure that maintains the state of outstanding prefetches. For our study, we assume a rather aggressive design of a prefetch issue buffer that contains 16 entries. If the prefetch issue buffer is already full, the processor is stalled until there is an available entry. (Later, in Section , we compare this with an architecture where prefetches are simply dropped if the buffer is full.) The secondary cache is also checked before the prefetch goes to memory. We model contention for the memory bus by assuming a maximum pipelining rate of one access every 20 cycles. Once the prefetched line returns, it is placed in both levels of the cache hierarchy. Filling the primary cache requires 4 cycles of exclusive access to the cache tags-during this time, the processor cannot execute any loads or stores; if it attempts to do so, it is stalled.
Since regular cache misses stall the processor, they are given priority over prefetch accesses both for the memory bus and the cache tags. We assume, however, that an ongoing prefetch access cannot be interrupted. As a result, a secondary cache miss may be delayed by as many as 20 cycles (memory pipeline occupancy time) when it tries to access memory. Similarly the processor may be stalled for up to 4 cycles (cache-tag busy time) when it executes a load or store. If a cache miss occurs for a line for which there is an outstanding prefetch waiting in the issue buffer, the miss is given immediate priority and the prefetch request is removed from the buffer. If the prefetch has already been issued to the memory system, any partial latency hiding that might have occurred is taken into account.