In this section, we evaluate the benefits of exclusive-mode prefetching, which helps to reduce both the miss latencies and the message traffic associated with writes. Unlike read misses, which directly stall the processor for their entire duration, write misses affect performance more indirectly, since writes can be buffered. A processor stalls while waiting for writes to complete in two situations: (i) when executing a write instruction if the write buffer is full, and (ii) during a read miss if previous writes must complete before the read miss can proceed. The impact of the former effect can be reduced through larger write buffers. Throughout this study, we use 16-entry write buffers, which we have found to be large enough to make the full-buffer effect negligible. The impact of the latter effect depends on whether reads are permitted to bypass writes (as allowed by the release consistency model), and whether the cache permits multiple outstanding accesses (as allowed by a lockup-free cache).
As we described earlier in Section , our compiler
uses an exclusive-mode (rather than a shared-mode) prefetch whenever any member
of an equivalence class (i.e. a set of references that share group
locality) is a write. This catches the important read-modify-write cases,
and potentially eliminates as much as half of the message traffic. Table
shows the fraction of prefetches that were
exclusive-mode for each of the applications. To evaluate the case where
exclusive-mode prefetches are not available, we replace each exclusive-mode
prefetch with a normal ``shared-mode'' prefetch of the same address. Since the
multiprocessor architecture we have used so far includes both release
consistency (which allows writes to be buffered and reads to bypass pending
writes) and lockup-free caches, write latency has no direct impact on
performance. Consequently exclusive-mode prefetching has a negligible
performance impact on this architecture. It does, however, reduce the amount of
message traffic, as shown in Table
. If the
architecture was bandwidth-limited (which in our case it is not), then this
reduction in message traffic could have a direct payoff in improved
performance.
To evaluate the benefit of exclusive-mode prefetching in an architecture
where write latency is not already completely hidden, we performed the
same experiment on an architecture that uses sequential consistency
rather than release consistency. With this stricter consistency model, the
processor must stall after every shared access until that access completes.
(We will discuss consistency models in greater detail later in Section
.) The results of this experiment are shown in
Figure
. Notice that the memory stall time in
Figure
has been broken down further into write
stall time and read stall time (under the release consistency model
assumed so far in this chapter, nearly all of the memory stall time is read
stall time).
Figure shows that exclusive-mode prefetching can result
in dramatic performance improvements in an architecture using sequential
consistency: OCEAN and MP3D achieved speedups of 73%and 37%,
respectively. The speedups for CHOLESKY and LOCUS were understandably
smaller (10%and 3%) since they make less use of exclusive-mode
prefetching, as shown in Table
. In the case of
LU, the write latency is small to begin with since the processors only
write to their local columns, which tend to fit in the secondary caches.
In summary, exclusive-mode prefetching can provide significant performance benefits in architectures that have not already eliminated write stall times through aggressive implementations of weaker consistency models with lockup-free caches. Even if write stall times cannot be further reduced, exclusive-mode prefetching can improve performance somewhat by reducing the traffic associated with cache coherency.