One way to cope with the latency of cache misses suffered by loads and stores is to buffer and pipeline their accesses. However, due to features of large-scale multiprocessors such as caches, distributed memory, and general interconnection networks, it is likely that multiple accesses issued by a processor will be performed out of order. This may lead to incorrect program behavior if the program depends upon accesses completing in a certain order. Therefore it may be necessary to restrict the types of buffering and pipelining that are permitted. These restrictions are dictated by the memory consistency model supported by the multiprocessor.
Several memory consistency models have been proposed. The strictest model is that of sequential consistency (SC) [50]. It requires the execution of a parallel program to appear as some interleaving of the execution of the parallel processes on a sequential machine. While conceptually intuitive, this model imposes severe restrictions on the buffering and pipelining of memory accesses. One of the least strict models is the release consistency model (RC) [27]. It requires that synchronization accesses in the program be identified and classified as either acquires (e.g., locks) or releases (e.g., unlocks). An acquire is a read operation (which can be part of a read-modify-write) that gains permission to access a set of data, while a release is a write operation that gives away such permission. This information is used to provide flexibility in the buffering and pipelining of accesses between synchronization points. The main advantage of the relaxed models is the potential for increased performance. The main disadvantage is increased hardware complexity and a more complex programming model. In this section we will evaluate the benefits of relaxed consistency models and explore their interaction with prefetching.
The multiprocessor architecture we have been using so far uses RC. The writes are buffered and the reads are allowed to bypass pending writes. With the lockup-free cache, a single read miss and an arbitrary number of write misses (limited only by the write buffer size) may be processed simultaneously. Therefore the latency of writes should have no direct impact on performance under our implementation of RC. The SC implementation that we evaluate is satisfied by ensuring that the memory accesses from each process complete in the order that they appear in the program. This is achieved by delaying the issue of an access until the previous access completes. Since the processor already stalls on reads until they complete, the only modification necessary to satisfy SC is to explicitly stall after every write until the write access completes.
Figure shows the performance of the multiprocessor applications under the SC and RC models. The memory stall time in these bars has been broken down further into both read stalls and write stalls. Comparing the cases without prefetching under the two models, the main performance impact of RC is to eliminate the time spent stalled for writes. In several cases (e.g., OCEAN, LU, and MP3D) this resulted in dramatic performance improvements (more than 40%). The pipelining of writes under RC also reduced synchronization stall times somewhat by allowing release operations (e.g., unlocks) to be propagated faster. While relaxing the consistency model effectively hides the latency of write accesses, the latency of read misses still remains.
Overall, the speedup due to prefetching under SC is typically at least as large as it is under RC. The reduction in read stall time is similar under both SC and RC. However, the reduction in write stall time varies depending on how effectively exclusive-mode prefetching is used. In three cases (OCEAN, LU, and MP3D) exclusive-mode prefetching eliminates most of the write latency, and therefore the speedup due to prefetching under SC is larger than under RC.
Comparing the cases with prefetching under the two models, the SC case approaches the absolute performance of the RC case when either (i) exclusive-mode prefetching effectively hides write latency (e.g., LU) or (ii) there is little write latency under SC to begin with (e.g., CHOLESKY). The case with the largest performance gap is PTHOR, where there is a significant amount of write latency, and little of it is hidden by prefetching. However, even when prefetching is relatively successful at reducing write latency under SC, enough remains that the best overall performance always comes through the combination of both prefetching and RC.
Hence, we see that prefetching and relaxed consistency models are complementary. Relaxed consistency models eliminate write latency in shared-memory multiprocessors, and prefetching reduces the remaining read latency.