Although relaxed consistency models are effective at eliminating write
latency, they do not address the problem of read latency. While prefetching
is one technique for hiding read latency, another technique is for the
processor to support multiple hardware contexts
[85][73][39][36][3] (also known as multithreading). As we mentioned earlier in
Section , multithreading has two
advantages over prefetching. First, it can handle arbitrarily complex
access patterns-even cases where it is impossible to predict the accesses
ahead of time (and therefore prefetching cannot succeed). This is because
multithreading simply reacts to misses once they occur, rather than
attempting to predict them. Multithreading tolerates latency by attempting
to overlap the latency of one context with the computation of other
concurrent contexts. The second advantage of multithreading is that it
requires no software support (assuming the code is already parallelized),
which as we mentioned in the previous section is only an advantage if the
user is unwilling or unable to recompile old code. Multithreading has three
limitations: (i) it relies on additional concurrency within an application,
which may not exist; (ii) some amount of time is lost when switching
between contexts; and (iii) to minimize context-switching overheads, a
significant amount of hardware support is necessary. In this section, we
will evaluate multithreading and explore its interactions with
software-controlled prefetching.
The performance improvement offered by multithreading depends on several factors. First, there is the number of contexts. With more contexts available, the processor is less likely to be out of ready-to-run contexts. However, the number of contexts is constrained by hardware costs and available parallelism in the application. Previous studies have shown that given processor caches, the interval between long-latency operations (i.e. cache misses) becomes fairly large, allowing just a handful of contexts to hide most of the latency [85]. The second factor is the context switch overhead. If the overhead is a sizable fraction of the typical run lengths (time between misses), a significant fraction of time may be wasted switching contexts. Shorter context switch times, however, require a more complex processor. Thirdly, the performance depends on the application behavior. Applications with clustered misses and irregular miss latencies will make it difficult to completely overlap computation of one context with memory accesses of other contexts. Multithreading processors will thus achieve a lower processor utilization on these programs than on applications with more regular miss behavior. Lastly, multiple contexts themselves affect the performance of the memory subsystem. The different contexts share a single processor cache and can interfere with each other, both constructively (by effectively prefetching another context's working set) and destructively (by displacing another context's working set). Also, as is the case with release consistency and prefetching, the memory system is more heavily loaded by multithreading, and thus latencies may increase.
In this study, we use processors with two and four contexts. We do not consider
more contexts per processor because 16 4-context processors require 64 parallel
threads and some of our applications do not achieve very good speedup with that
many threads. We use two different context switch overheads: 4 and 16
cycles. A four-cycle context switch overhead
corresponds to flushing/loading a short RISC pipeline when switching to the new
instruction stream. An overhead of sixteen cycles corresponds to a less
aggressive implementation. In our study, we include additional buffers to avoid
thrashing and deadlock when two contexts try to read distinct memory lines that
map to the same cache line. All of these experiments assume an RC model.