Next:
Architectural Issues
Up:
Prefetching for Multiprocessors
Previous: Summary
In this chapter, we extended our uniprocessor prefetching algorithm to
handle multiprocessor applications as follows. Through the use of non-binding prefetching, our algorithm is freed from concerns over
violating multiprocessor correctness, and can focus instead on improving
performance. To account for the additional cache misses due to communication, we modify our original algorithm to conservatively assume
that shared data does not remain in the cache across synchronization
statements. Finally, to hide the latency of gaining ownership of a line,
and to reduce the bandwidth consumed in read-modify-write sequences, we
exploit exclusive-mode prefetching.
Our experiments with prefetching for multiprocessors produced the
following results:
- Similar to our uniprocessor results, the compiler algorithm is
successful at hiding memory latency while minimizing prefetching
overhead, and again improves performance by as much as twofold. These
encouraging results occur despite our algorithm's conservative
assumptions about the effect of communication on miss rates.
- Exclusive-mode prefetching eliminated as much as 27%of the
total message traffic in our architecture. While this had little direct
impact on the performance of our base architecture, we demonstrated that
it could improve performance by as much as 73%for an architecture that
has not already hidden write latency through relaxed consistency
models. In addition, the reduction in message traffic could translate
directly into improved performance on a bandwidth-limited architecture
(e.g., a bus-based multiprocessor).
- Our algorithm is generally robust with respect to variations in
the cache size. Since our algorithm is effective at prefetching
replacement misses, it does not suffer sudden drops in performance as
the cache size is decreased, as we sometimes see in code without
prefetching. With larger cache sizes, where sharing misses dominate, we
still see significant benefits from prefetching in most cases.
- When our compiler algorithm succeeds at inserting prefetches, it
lives up to the performance potential of hand-inserted prefetching.
- To further increase the coverage of our algorithm, techniques
such as interprocedural analysis and recognizing recursive data
structures may be useful.
Overall, these results are quite encouraging, and they demonstrate that
automatic compiler-inserted prefetching is an attractive technique for
tolerating latency in large-scale multiprocessors.
0
Next:
Architectural Issues
Up:
Prefetching for Multiprocessors
Previous: Summary