A lockup-free cache is a common requirement for most latency-hiding techniques, including prefetching, relaxed consistency models, non-blocking loads, and multithreading. The complexity of implementing a lockup-free cache depends on which of these techniques it is intended to support (as described in detail by Laudon [52]). For example, if the goal is simply to support multiple outstanding prefetches, then it is not strictly necessary for the processor to maintain state on outstanding transactions, as long as the cache is prepared to receive prefetch responses from outside the processor while the processor may be simultaneously issuing new requests. In contrast, supporting multiple outstanding stores (as with relaxed consistency models) or loads (if they are non-blocking) does require that special state be maintained for each outstanding access. For stores, the stored data must be merged into the cache line when it returns. For loads, the requested data must be forwarded directly to a register-thus requiring state to associate each outstanding access with the register(s) waiting for the value-and any future uses of that register must interlock if the value has not returned yet.
Kroft [45] presented the original lockup-free cache design, which adds structures called ``miss information/status handling registers'' (MSHRs) to keep track of outstanding misses. Each MSHR contains enough state to handle one or more accesses of any type to a single memory line. Due to the generality of the MSHR mechanism, the amount of state involved is non-trivial, including the address, pointers to the cache entry and destination register, written data, and various other pieces of state. The majority of subsequent lockup-free cache proposals have been a variation of this original MSHR scheme [46][79][70][63]. An alternative approach is to maintain the state of outstanding misses in the cache tag array itself [52][17], thereby permitting a larger number of outstanding misses.
For the uniprocessor architecture used in Chapter , the lockup-free cache supports a single load or store miss (since loads and stores directly stall the processor) and up to sixteen prefetch misses. The state of outstanding prefetch misses is maintained in a sixteen-entry prefetch issue buffer. This structure is simpler than a normal MSHR because it contains only the prefetch addresses and any state necessary to track or control outstanding misses. It does not contain any data, since the prefetched data is placed directly into the cache hierarchy.
To illustrate how the prefetch issue buffer is used, we will walk through the steps of processing a prefetch. When the processor executes a prefetch instruction, the primary data cache is checked during that cycle. If the data is already present in the cache, then the prefetch is considered completed at that point. Otherwise, the prefetch addresses is then compared against the addresses already in the prefetch issue buffer. If there is a match, then the prefetch is dropped. Otherwise, a new entry is allocated in the prefetch issue buffer for the new prefetch. At this point, the prefetch will proceed through the memory hierarchy to first find the data and then place it in the primary cache, as was discussed in Section . An entry in the prefetch issue buffer is deallocated as soon as the prefetch completes (as opposed to waiting for the data to actually be referenced).
Although a prefetch issue buffer is not strictly necessary, it offers the following performance advantages. First, in cases where the prefetch miss cannot be serviced immediately (perhaps because of a limited number of MSHRs), it allows the processor to continue executing by buffering the prefetch request. Second, by keeping track of the prefetches that are already outstanding, it provides a mechanism for merging subsequent prefetch and memory references to matching cache lines. Merging a prefetch with a previous prefetch simply means dropping the subsequent prefetch. This helps to avoid unnecessary bandwidth consumption further down the memory hierarchy, similar to the benefit of checking the primary cache when searching for the prefetched data (as discussed earlier in Section ). However, with selective prefetching, the compiler is generally good at avoiding back-to-back redundant prefetches (at least for these scientific codes). The more important benefit occurs when a load is merged with a previous prefetch that was partially completed, since this allows the load to benefit from whatever partial latency-hiding the prefetch has been able to accomplish. This is especially important for cases where data or control-flow dependencies make it impossible to issue prefetches early enough to hide the entire memory latency, which can occur frequently in pointer-based codes such as PTHOR (e.g., when attempting to prefetch a linked list).
Figure shows how the prefetch issue buffer is incorporated into our uniprocessor architecture. In this figure, we have shown distinct MSHRs to allow for the possibility that misses are handled separately from the buffering of prefetch requests. Note that in our model, a prefetch remains in the prefetch issue buffer until it is completed by its MSHR. In our experiments so far, we have assumed a sixteen-entry deep prefetch buffer and seventeen MSHRs (one for each prefetch and one for either a load or store). We chose these large parameters to minimize their effect on performance. We will now examine how many prefetch issue buffer entries and MSHRs are actually needed for our benchmarks.
Before running these experiments with reduced prefetch buffering and MSHR capacities, we can gain insight into what to expect by examining the number of outstanding prefetches in the original ``full-capacity'' architecture. Figure plots the distribution of prefetch misses already outstanding during each prefetch miss for the original architecture. As we see in Figure , the distributions vary considerably across applications. On the one extreme is EP, where there is almost never more than one outstanding prefetch. On the other hand there are programs like CHOLSKY, where the distribution is spread out over larger numbers of previously outstanding prefetches (i.e. the distribution has a significant tail). We would expect these latter types of applications to benefit from having multiple MSHRs and deep prefetch buffers.
To measure the actual impact on performance, we first varied the number of MSHRs while holding the depth of the prefetch issue buffer fixed at sixteen entries. Figure shows the results of these experiments. A single MSHR corresponds to a ``blocking'' cache (i.e. one that is not lockup-free), while seventeen MSHRs corresponds to the original architecture. Note that with fewer than seventeen MSHRs, there may potentially be more outstanding miss requests (up to seventeen) than can be serviced at a given time, in which case prefetch misses must compete with load and store misses for the MSHRs. We resolved such cases by giving load or store misses (which directly stall the processor) priority for the next available MSHR-otherwise, the oldest unserviced prefetch will be awarded the next available MSHR.
As we see in Figure , the benefits of a lockup-free cache are often substantial. Seven of the thirteen applications showed at least an 15%performance improvement by going from one to two MSHRs. In six of those seven cases, there was at least a 5%additional improvement from having four MSHRs. The benefits of moving from four to seventeen MSHRs were typically quite small. These performance improvements correspond to what we would expect from the distributions shown in Figure . In particular, the largest improvements occur in the benchmarks with the largest tails in their distributions (e.g., CHOLSKY, GMTRY, VPENTA, and CG). Therefore we see that it is important to have a lockup-free cache, and that four outstanding misses are sufficient to capture most of the benefits for these applications.
Given that four MSHRs are sufficient, the next question is whether buffering additional prefetch requests beyond that is advantageous, or whether the prefetch issue buffer should also be reduced to four entries. To evaluate this, we simulated the performance when the number of prefetch issue buffer entries is reduced, while holding the number of MSHRs fixed at four. Figure shows the two cases (CHOLSKY and VPENTA) where this caused a noticeable difference in performance (in six of the other cases, there was less than a 2%difference between four and eight buffer entries, and no difference between eight and sixteen buffer entries). CHOLSKY showed the largest improvement, with a 6%speedup when going from four to eight buffer entries, and 5%speedup beyond that for having sixteen entries. VPENTA showed a 5%speedup between four and eight entries, and 1%additional speedup with sixteen entries. Therefore we see that additional prefetch buffering is beneficial in a small number of cases. However, given that this benefit is relatively small and infrequent, the hardware cost would also have to be quite small for it to be justified.