The prefetching algorithm used so far in this chapter attempts to prefetch
both dense and indirect array references. Indirect references
are prefetched as described in Section . Only
one of the multiprocessor applications contained a significant number of
indirect array references: MP3D. Figure
breaks down
how much of the prefetching benefit came from the dense versus the indirect
prefetches.
As we see in Figure , the overwhelming majority of the
benefit was from prefetching the indirect references. This is in contrast
with the results we saw earlier in Figure
for the
uniprocessor version of MP3D, where there was very little advantage to
prefetching the indirect references. The difference between these two cases
is that the indirect references are to objects that are very actively
shared and modified amongst the processors (the ``space cells''), whereas
the dense references are to objects that are rarely shared and reside in a
processor's local memory (the ``particles''). Therefore the miss latency
tends to be substantially larger for the indirect references, since they
are often found dirty in a remote processor's cache, in contrast with the
dense references, which are found locally.
This application illustrates several aspects of our prefetching compiler
algorithm: (i) locality analysis to reduce the overhead of prefetching
dense matrix references (as shown in Figure ),
(ii) prefetching indirect references (as shown in Figure
), and (iii) non-binding prefetching for
multiprocessors (as evidenced by the size of the ``pf-miss:
invalidated'' category in Figure
). We now
consider the final aspect of our multiprocessor prefetching algorithm,
which is using exclusive-mode prefetches.