Next: Architectural Issues Up: Prefetching for Multiprocessors Previous: Summary

Chapter Summary

In this chapter, we extended our uniprocessor prefetching algorithm to handle multiprocessor applications as follows. Through the use of non-binding prefetching, our algorithm is freed from concerns over violating multiprocessor correctness, and can focus instead on improving performance. To account for the additional cache misses due to communication, we modify our original algorithm to conservatively assume that shared data does not remain in the cache across synchronization statements. Finally, to hide the latency of gaining ownership of a line, and to reduce the bandwidth consumed in read-modify-write sequences, we exploit exclusive-mode prefetching.

Our experiments with prefetching for multiprocessors produced the following results:

Similar to our uniprocessor results, the compiler algorithm is successful at hiding memory latency while minimizing prefetching overhead, and again improves performance by as much as twofold. These encouraging results occur despite our algorithm's conservative assumptions about the effect of communication on miss rates.
Exclusive-mode prefetching eliminated as much as 27%of the total message traffic in our architecture. While this had little direct impact on the performance of our base architecture, we demonstrated that it could improve performance by as much as 73%for an architecture that has not already hidden write latency through relaxed consistency models. In addition, the reduction in message traffic could translate directly into improved performance on a bandwidth-limited architecture (e.g., a bus-based multiprocessor).
Our algorithm is generally robust with respect to variations in the cache size. Since our algorithm is effective at prefetching replacement misses, it does not suffer sudden drops in performance as the cache size is decreased, as we sometimes see in code without prefetching. With larger cache sizes, where sharing misses dominate, we still see significant benefits from prefetching in most cases.
When our compiler algorithm succeeds at inserting prefetches, it lives up to the performance potential of hand-inserted prefetching.
To further increase the coverage of our algorithm, techniques such as interprocedural analysis and recognizing recursive data structures may be useful.

Overall, these results are quite encouraging, and they demonstrate that automatic compiler-inserted prefetching is an attractive technique for tolerating latency in large-scale multiprocessors.

Next: Architectural Issues Up: Prefetching for Multiprocessors Previous: Summary

tcm@
Sat Jun 25 15:13:04 PDT 1994