The prefetching algorithm used so far in this chapter attempts to prefetch both dense and indirect array references. Indirect references are prefetched as described in Section . Only one of the multiprocessor applications contained a significant number of indirect array references: MP3D. Figure breaks down how much of the prefetching benefit came from the dense versus the indirect prefetches.
As we see in Figure , the overwhelming majority of the benefit was from prefetching the indirect references. This is in contrast with the results we saw earlier in Figure for the uniprocessor version of MP3D, where there was very little advantage to prefetching the indirect references. The difference between these two cases is that the indirect references are to objects that are very actively shared and modified amongst the processors (the ``space cells''), whereas the dense references are to objects that are rarely shared and reside in a processor's local memory (the ``particles''). Therefore the miss latency tends to be substantially larger for the indirect references, since they are often found dirty in a remote processor's cache, in contrast with the dense references, which are found locally.
This application illustrates several aspects of our prefetching compiler algorithm: (i) locality analysis to reduce the overhead of prefetching dense matrix references (as shown in Figure ), (ii) prefetching indirect references (as shown in Figure ), and (iii) non-binding prefetching for multiprocessors (as evidenced by the size of the ``pf-miss: invalidated'' category in Figure ). We now consider the final aspect of our multiprocessor prefetching algorithm, which is using exclusive-mode prefetches.