One way to tolerate memory latency is to allow references to be buffered and pipelined. In current uniprocessor systems, this technique is typically applied only to writes in the form of write buffers. Write buffers exploit the fact that a processor does not have to wait for a write to complete as long as it properly observes the effect of the written data in the future. Therefore the processor can perform a write by simply issuing it to the write buffer, provided that future reads check the write buffer for matching addresses. The advantage of a write buffer is not only that the processor does not stall when executing a write, but also that multiple writes can be overlapped to exploit pipelining.
Buffering read accesses is more difficult because unlike writes, the processor typically cannot proceed until the read access completes, since it needs the data that is being read. With non-blocking loads and a lockup-free cache [45], it is possible to buffer and pipeline reads. A non-blocking load means that rather than stalling at the time the load is performed, the processor postpones stalling until the data is actually used. A lockup-free cache permits multiple outstanding cache misses. By combining the two, it would be possible to buffer multiple reads, and to pipeline their accesses. However, very few commercial microprocessors currently support non-blocking loads due to the complexity involved, and in practice the use of a load value typically occurs shortly after the load is performed. Therefore tolerating read latency through buffering and pipelining is not especially promising.
Buffering and pipelining accesses in a multiprocessor is complicated by the restrictions placed on the causality of accesses in different processors. In the strictest case, known as sequential or strong consistency [50], all accesses to shared data must appear as though the different processes were interleaved on a sequential machine. While conceptually intuitive and elegant, sequential consistency imposes severe restrictions on the outstanding accesses that a process may have, thus restricting the buffering and pipelining allowed. In contrast, relaxed consistency models [27][26][18][2] permit accesses to be buffered and pipelined, provided that explicit synchronization events are identified and ordered properly. Once again, however, the main benefit of these relaxed consistency models is hiding write latency [26]. To address read latency effectively, we must look beyond buffering and pipelining.